Nepali Transliteration Model
Model Description
This model performs bidirectional transliteration between Nepali (Devanagari script) and English (Latin script). It can convert:
- English text to Nepali Devanagari script
- Nepali Devanagari text to English romanization
The model is fine-tuned for accurate transliteration of Nepali names, places, and common vocabulary.
Model Details
- Model Type: Sequence-to-sequence text generation
- Language(s): Nepali (ne), English (en)
- License: Apache 2.0
- Base Model: [Specify your base model, e.g., T5, mT5, etc.]
- Training Data: Custom Nepali-English transliteration dataset
- Training Steps: [Update with actual number]
- Parameters: [Update with model size]
Intended Use
Primary Use Cases
- Converting English names and words to Nepali Devanagari script
- Romanizing Nepali text for international audiences
- Supporting multilingual applications and keyboards
- Academic research in computational linguistics
- Cultural preservation and digital humanities projects
Out-of-Scope Use Cases
- Machine translation (this model only handles transliteration, not translation)
- Text generation beyond transliteration
- Processing languages other than Nepali and English
How to Use
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = nirajan111/nepali-transliteration"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# English to Nepali
def transliterate_en_to_ne(text):
inputs = tokenizer(f"en2ne: {text}", return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Nepali to English
def transliterate_ne_to_en(text):
inputs = tokenizer(f"ne2en: {text}", return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Examples
print(transliterate_en_to_ne("namaste")) # Expected: नमस्ते
print(transliterate_ne_to_en("काठमाडौं")) # Expected: kathmandu
Advanced Usage
# Batch processing
texts = ["namaste", "dhanyabad", "kathmandu"]
inputs = tokenizer([f"en2ne: {text}" for text in texts],
return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
Training Data
The model was trained on a custom dataset containing:
- Size: [Update with dataset size, e.g., 50,000 transliteration pairs]
- Sources:
- Nepali names and places
- Common vocabulary
- Cultural terms
- Government documents
- Educational materials
- Preprocessing: Text normalization, duplicate removal, quality filtering
- Split: 80% training, 10% validation, 10% testing
Training Procedure
Training Hyperparameters
- Batch Size: 64 (training), 16 (evaluation)
- Learning Rate: [Update with actual value]
- Epochs: 10
- Optimizer: AdamW
- Weight Decay: 0.01
- Warmup Steps: 500
- Max Sequence Length: 128
Training Infrastructure
- Hardware: [Update with your setup, e.g., Tesla V100, A100]
- Framework: PyTorch, Transformers
- Training Time: [Update with actual time]
Evaluation
Metrics
- BLEU Score: 0.85 (update with actual)
- Word Accuracy: 0.92 (update with actual)
- Character Error Rate: 0.08 (update with actual)
- Exact Match: 0.78 (update with actual)
Test Results
Direction | CER |
---|---|
EN → NE | 0.13 |
NE → EN | 0.10 |
Limitations and Bias
Known Limitations
- Performance may vary with proper nouns not seen during training
- Limited handling of mixed-script text
- May struggle with very long compound words
- Accuracy depends on text quality and standardization
Potential Biases
- Training data may over-represent certain regions or dialects of Nepali
- Model may have better performance on formal/literary Nepali vs. colloquial forms
- Potential bias toward more common transliteration patterns
Ethical Considerations
- This model supports language preservation and digital inclusion for Nepali speakers
- Care should be taken when using for official documents or names
- Users should verify outputs for critical applications
- The model should not be used to misrepresent or appropriate Nepali culture
Citation
@model{nepali-transliteration-2024,
title={Nepali Transliteration Model},
author={Nirajan Sah},
year={2025},
url={https://huggingface.co/nirajan1111/nepali-transliteration-model}
}
Model Card Contact
For questions or feedback about this model, please contact: [[email protected]]
Acknowledgments
- Thanks to the Nepali language community for providing linguistic insights
- [Add any other acknowledgments]
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Evaluation results
- BLEU Scoreself-reported0.850
- Word Accuracyself-reported0.920