Native Sanskrit-English Tokenizer for Qwen2.5

Problem Statement

The original Qwen2.5 tokenizer produces inefficient byte-level tokens for Sanskrit text:

  • Qwen's output: ['ह', 'र', 'à¥ĩ', 'Ġà¤ķ', 'à¥', 'ĥ', 'ष', 'à¥įà¤', '£'] (36 tokens)
  • Our output: ['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे'] (8 tokens)

This tokenizer provides 4.5x better efficiency with readable, meaningful tokens.

Usage

from transformers import AutoTokenizer

# Load tokenizer (native Hugging Face format)
tokenizer = AutoTokenizer.from_pretrained("diabolic6045/Sanskrit-English-qwen2-tokenizer")

# Test Sanskrit tokenization
text = "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे"
tokens = tokenizer.tokenize(text)
print(tokens)  # ['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे']

# Perfect reconstruction
decoded = tokenizer.decode(tokenizer.encode(text))
print(decoded)  # "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे"

# Chat template support
messages = [{'role': 'user', 'content': 'What is the meaning of हरे कृष्ण?'}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)

Performance Comparison

Tokenizer Tokens Readable Efficiency Format
Ours 8 YES 4.5x better Native HF
Qwen 36 NO Poor ByteLevel BPE

Training with Axolotl

# qwen.yaml
base_model: Qwen/Qwen2.5-1.5B
tokenizer_config: diabolic6045/Sanskrit-English-qwen2-tokenizer
resize_token_embeddings_to_32x: true
# Start training
accelerate launch -m axolotl.cli.train qwen.yaml

Key Features

  • Native Hugging Face Format - No custom code needed
  • 120,000 vocabulary trained on massive English+Sanskrit corpus
  • Clean, readable tokens - no more byte-level artifacts
  • 4.5x more efficient than Qwen's original tokenizer
  • Official Qwen chat template - ready for inference
  • Personalized identity - "Created by Divax Shah (diabolic6045)"
  • Axolotl compatible - works seamlessly with distributed training

Training Pipeline

  1. Base Model Training - Train on Sanskrit text completion
  2. Instruct Tuning - Add chat capabilities with proper formatting
  3. Deployment - Use for Sanskrit-English applications

Technical Details: TECHNICAL_README.md

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train diabolic6045/Sanskrit-English-qwen2-tokenizer

Space using diabolic6045/Sanskrit-English-qwen2-tokenizer 1

Collection including diabolic6045/Sanskrit-English-qwen2-tokenizer