Nepali Transliteration Model

Model Description

This model performs bidirectional transliteration between Nepali (Devanagari script) and English (Latin script). It can convert:

English text to Nepali Devanagari script
Nepali Devanagari text to English romanization

The model is fine-tuned for accurate transliteration of Nepali names, places, and common vocabulary.

Model Details

Model Type: Sequence-to-sequence text generation
Language(s): Nepali (ne), English (en)
License: Apache 2.0
Base Model: [Specify your base model, e.g., T5, mT5, etc.]
Training Data: Custom Nepali-English transliteration dataset
Training Steps: [Update with actual number]
Parameters: [Update with model size]

Intended Use

Primary Use Cases

Converting English names and words to Nepali Devanagari script
Romanizing Nepali text for international audiences
Supporting multilingual applications and keyboards
Academic research in computational linguistics
Cultural preservation and digital humanities projects

Out-of-Scope Use Cases

Machine translation (this model only handles transliteration, not translation)
Text generation beyond transliteration
Processing languages other than Nepali and English

How to Use

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = nirajan111/nepali-transliteration"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# English to Nepali
def transliterate_en_to_ne(text):
    inputs = tokenizer(f"en2ne: {text}", return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Nepali to English
def transliterate_ne_to_en(text):
    inputs = tokenizer(f"ne2en: {text}", return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
print(transliterate_en_to_ne("namaste"))  # Expected: नमस्ते
print(transliterate_ne_to_en("काठमाडौं"))  # Expected: kathmandu

Advanced Usage

# Batch processing
texts = ["namaste", "dhanyabad", "kathmandu"]
inputs = tokenizer([f"en2ne: {text}" for text in texts], 
                  return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Training Data

The model was trained on a custom dataset containing:

Size: [Update with dataset size, e.g., 50,000 transliteration pairs]
Sources:
- Nepali names and places
- Common vocabulary
- Cultural terms
- Government documents
- Educational materials
Preprocessing: Text normalization, duplicate removal, quality filtering
Split: 80% training, 10% validation, 10% testing

Training Procedure

Training Hyperparameters

Batch Size: 64 (training), 16 (evaluation)
Learning Rate: [Update with actual value]
Epochs: 10
Optimizer: AdamW
Weight Decay: 0.01
Warmup Steps: 500
Max Sequence Length: 128

Training Infrastructure

Hardware: [Update with your setup, e.g., Tesla V100, A100]
Framework: PyTorch, Transformers
Training Time: [Update with actual time]

Evaluation

Metrics

BLEU Score: 0.85 (update with actual)
Word Accuracy: 0.92 (update with actual)
Character Error Rate: 0.08 (update with actual)
Exact Match: 0.78 (update with actual)

Test Results

Direction	CER
EN → NE	0.13
NE → EN	0.10

Limitations and Bias

Known Limitations

Performance may vary with proper nouns not seen during training
Limited handling of mixed-script text
May struggle with very long compound words
Accuracy depends on text quality and standardization

Potential Biases

Training data may over-represent certain regions or dialects of Nepali
Model may have better performance on formal/literary Nepali vs. colloquial forms
Potential bias toward more common transliteration patterns

Ethical Considerations

This model supports language preservation and digital inclusion for Nepali speakers
Care should be taken when using for official documents or names
Users should verify outputs for critical applications
The model should not be used to misrepresent or appropriate Nepali culture

Citation

@model{nepali-transliteration-2024,
  title={Nepali Transliteration Model},
  author={Nirajan Sah},
  year={2025},
  url={https://huggingface.co/nirajan1111/nepali-transliteration-model}
}

Model Card Contact

For questions or feedback about this model, please contact: [[email protected]]

Acknowledgments

Thanks to the Nepali language community for providing linguistic insights
[Add any other acknowledgments]

nirajan111
/

nepali-transliteration