Sarvam-M 4-bit Quantized

This is a 4-bit quantized version of sarvamai/sarvam-m using BitsAndBytesConfig with NF4 quantization.

Model Details

  • Base Model: sarvamai/sarvam-m
  • License: Apache 2.0
  • Quantization Method: BitsAndBytes 4-bit NF4
  • Compute dtype: bfloat16
  • Double Quantization: Enabled
  • Size Reduction: ~75% smaller than original model (14GB vs ~70GB)
  • Memory Usage: ~4x less GPU memory required

Key Features

  • Efficient Inference: Significantly reduced memory footprint
  • Thinking Mode: Supports reasoning capabilities with enable_thinking parameter
  • Chat Template: Optimized for conversational AI applications
  • Device Mapping: Automatic device placement for multi-GPU setups

Installation

pip install transformers torch accelerate bitsandbytes

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tarun7r/sarvam-m-bnb-4bit"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"
)

# Prepare input
prompt = "Who are you and what is your purpose?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True  # Enable reasoning mode
)

# Generate response
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
output_text = tokenizer.decode(output_ids)

# Parse thinking and response
if "</think>" in output_text:
    reasoning_content = output_text.split("</think>")[0].rstrip("\n")
    content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
    print("Reasoning:", reasoning_content)
    print("Response:", content)
else:
    content = output_text.rstrip("</s>")
    print("Response:", content)

Advanced Usage with Custom Parameters

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Optional: Explicit quantization config (will be ignored if model already quantized)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,  # Optional
    device_map="auto",
    torch_dtype="auto"
)

# Generate with custom parameters
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

Thinking Mode

The model supports two modes:

  1. Thinking Mode (enable_thinking=True): Model shows reasoning process
  2. Direct Mode (enable_thinking=False): Direct response without reasoning
# Enable thinking mode for complex reasoning
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True
)

# Disable for quick responses
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=False
)

Performance Comparison

Model Version Size GPU Memory Loading Time
Original ~70GB ~70GB VRAM ~5-10 min
4-bit Quantized ~14GB ~18GB VRAM ~1-2 min

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers 4.35+
  • BitsAndBytes 0.41+
  • CUDA-compatible GPU (recommended)

Limitations

  • Slight performance degradation compared to full precision model
  • Requires BitsAndBytes library for loading
  • May have minor numerical differences in outputs

License

Apache 2.0 (same as original model)

Attribution

  • Original Model: Sarvam AI
  • Quantization: Created using BitsAndBytes library
  • Base Model License: Apache 2.0

Disclaimer

This is an unofficial quantized version. For the original model and official support, please refer to sarvamai/sarvam-m.

Downloads last month
45
Safetensors
Model size
12.8B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for tarun7r/sarvam-m-bnb-4bit

Finetuned
sarvamai/sarvam-m
Quantized
(21)
this model