Edit model card

The fully-trained version of this model is now available at https://huggingface.co/sarvamai/sarvam-1

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using this notebook on Google colab!

This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

The final checkpoint of sarvam-2b will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in this video.

The model was trained with NVIDIA NeMo™ Framework on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:

from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'

Tokenizer

sarvam-2b's tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between sarvam-2b and other popular models.

Sarvam-2B Llama-3.1 Gemma-2 GPT-4o
ben_Beng 2.07 8.02 3.72 2.34
eng_Latn 1.43 1.24 1.23 1.23
guj_Gujr 1.81 9.97 3.9 2.3
hin_Deva 1.4 2.67 1.96 1.65
kan_Knda 2.37 14.95 5.55 3.29
mal_Mlym 2.85 16.26 5.88 3.52
mar_Deva 1.77 3.99 3.2 2.56
ory_Orya 2.35 16.84 6.87 6.83
pan_Guru 1.68 8.19 3.37 2.72
tam_Taml 2.17 12.39 4.19 3.17
tel_Telu 2.14 13.3 4.57 3.06
Average 2.08 9.34 4.01 3.00

More technical details like evaluations and benchmarking will be posted soon.

Downloads last month
1,641
Safetensors
Model size
2.51B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for sarvamai/sarvam-2b-v0.5

Adapters
2 models
Finetunes
8 models
Quantizations
8 models

Spaces using sarvamai/sarvam-2b-v0.5 6