Sarvam-M 4-bit Quantized

This is a 4-bit quantized version of sarvamai/sarvam-m using BitsAndBytesConfig with NF4 quantization.

Model Details

Base Model: sarvamai/sarvam-m
License: Apache 2.0
Quantization Method: BitsAndBytes 4-bit NF4
Compute dtype: bfloat16
Double Quantization: Enabled
Size Reduction: ~75% smaller than original model (14GB vs ~70GB)
Memory Usage: ~4x less GPU memory required

Key Features

Efficient Inference: Significantly reduced memory footprint
Thinking Mode: Supports reasoning capabilities with enable_thinking parameter
Chat Template: Optimized for conversational AI applications
Device Mapping: Automatic device placement for multi-GPU setups

Installation

pip install transformers torch accelerate bitsandbytes

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tarun7r/sarvam-m-bnb-4bit"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"
)

# Prepare input
prompt = "Who are you and what is your purpose?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True  # Enable reasoning mode
)

# Generate response
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
output_text = tokenizer.decode(output_ids)

# Parse thinking and response
if "</think>" in output_text:
    reasoning_content = output_text.split("</think>")[0].rstrip("\n")
    content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
    print("Reasoning:", reasoning_content)
    print("Response:", content)
else:
    content = output_text.rstrip("</s>")
    print("Response:", content)

Advanced Usage with Custom Parameters

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Optional: Explicit quantization config (will be ignored if model already quantized)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,  # Optional
    device_map="auto",
    torch_dtype="auto"
)

# Generate with custom parameters
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

Thinking Mode

The model supports two modes:

Thinking Mode (enable_thinking=True): Model shows reasoning process
Direct Mode (enable_thinking=False): Direct response without reasoning

# Enable thinking mode for complex reasoning
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True
)

# Disable for quick responses
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=False
)

Performance Comparison

Model Version	Size	GPU Memory	Loading Time
Original	~70GB	~70GB VRAM	~5-10 min
4-bit Quantized	~14GB	~18GB VRAM	~1-2 min

Requirements

Python 3.8+
PyTorch 2.0+
Transformers 4.35+
BitsAndBytes 0.41+
CUDA-compatible GPU (recommended)

Limitations

Slight performance degradation compared to full precision model
Requires BitsAndBytes library for loading
May have minor numerical differences in outputs

License

Apache 2.0 (same as original model)

Attribution

Original Model: Sarvam AI
Quantization: Created using BitsAndBytes library
Base Model License: Apache 2.0

Disclaimer

This is an unofficial quantized version. For the original model and official support, please refer to sarvamai/sarvam-m.

tarun7r
/

sarvam-m-bnb-4bit