Sarvam-M 4-bit Quantized
This is a 4-bit quantized version of sarvamai/sarvam-m using BitsAndBytesConfig with NF4 quantization.
Model Details
- Base Model: sarvamai/sarvam-m
- License: Apache 2.0
- Quantization Method: BitsAndBytes 4-bit NF4
- Compute dtype: bfloat16
- Double Quantization: Enabled
- Size Reduction: ~75% smaller than original model (14GB vs ~70GB)
- Memory Usage: ~4x less GPU memory required
Key Features
- Efficient Inference: Significantly reduced memory footprint
- Thinking Mode: Supports reasoning capabilities with
enable_thinking
parameter - Chat Template: Optimized for conversational AI applications
- Device Mapping: Automatic device placement for multi-GPU setups
Installation
pip install transformers torch accelerate bitsandbytes
Usage
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "tarun7r/sarvam-m-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
# Prepare input
prompt = "Who are you and what is your purpose?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=True # Enable reasoning mode
)
# Generate response
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
output_text = tokenizer.decode(output_ids)
# Parse thinking and response
if "</think>" in output_text:
reasoning_content = output_text.split("</think>")[0].rstrip("\n")
content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
print("Reasoning:", reasoning_content)
print("Response:", content)
else:
content = output_text.rstrip("</s>")
print("Response:", content)
Advanced Usage with Custom Parameters
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Optional: Explicit quantization config (will be ignored if model already quantized)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config, # Optional
device_map="auto",
torch_dtype="auto"
)
# Generate with custom parameters
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
Thinking Mode
The model supports two modes:
- Thinking Mode (
enable_thinking=True
): Model shows reasoning process - Direct Mode (
enable_thinking=False
): Direct response without reasoning
# Enable thinking mode for complex reasoning
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=True
)
# Disable for quick responses
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
Performance Comparison
Model Version | Size | GPU Memory | Loading Time |
---|---|---|---|
Original | ~70GB | ~70GB VRAM | ~5-10 min |
4-bit Quantized | ~14GB | ~18GB VRAM | ~1-2 min |
Requirements
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.35+
- BitsAndBytes 0.41+
- CUDA-compatible GPU (recommended)
Limitations
- Slight performance degradation compared to full precision model
- Requires BitsAndBytes library for loading
- May have minor numerical differences in outputs
License
Apache 2.0 (same as original model)
Attribution
- Original Model: Sarvam AI
- Quantization: Created using BitsAndBytes library
- Base Model License: Apache 2.0
Disclaimer
This is an unofficial quantized version. For the original model and official support, please refer to sarvamai/sarvam-m.
- Downloads last month
- 45
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for tarun7r/sarvam-m-bnb-4bit
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503
Finetuned
sarvamai/sarvam-m