Sanskrit LLMs
Collection
Projects I did related to make LLM better in Sanskrit
•
10 items
•
Updated
•
2
The original Qwen2.5 tokenizer produces inefficient byte-level tokens for Sanskrit text:
['ह', 'र', 'à¥ĩ', 'Ġà¤ķ', 'à¥', 'ĥ', 'ष', 'à¥įà¤', '£'] (36 tokens)['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे'] (8 tokens)This tokenizer provides 4.5x better efficiency with readable, meaningful tokens.
from transformers import AutoTokenizer
# Load tokenizer (native Hugging Face format)
tokenizer = AutoTokenizer.from_pretrained("diabolic6045/Sanskrit-English-qwen2-tokenizer")
# Test Sanskrit tokenization
text = "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे"
tokens = tokenizer.tokenize(text)
print(tokens) # ['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे']
# Perfect reconstruction
decoded = tokenizer.decode(tokenizer.encode(text))
print(decoded) # "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे"
# Chat template support
messages = [{'role': 'user', 'content': 'What is the meaning of हरे कृष्ण?'}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)
| Tokenizer | Tokens | Readable | Efficiency | Format |
|---|---|---|---|---|
| Ours | 8 | YES | 4.5x better | Native HF |
| Qwen | 36 | NO | Poor | ByteLevel BPE |
# qwen.yaml
base_model: Qwen/Qwen2.5-1.5B
tokenizer_config: diabolic6045/Sanskrit-English-qwen2-tokenizer
resize_token_embeddings_to_32x: true
# Start training
accelerate launch -m axolotl.cli.train qwen.yaml