Nepali Transliteration Model

Model Description

This model performs bidirectional transliteration between Nepali (Devanagari script) and English (Latin script). It can convert:

  • English text to Nepali Devanagari script
  • Nepali Devanagari text to English romanization

The model is fine-tuned for accurate transliteration of Nepali names, places, and common vocabulary.

Model Details

  • Model Type: Sequence-to-sequence text generation
  • Language(s): Nepali (ne), English (en)
  • License: Apache 2.0
  • Base Model: [Specify your base model, e.g., T5, mT5, etc.]
  • Training Data: Custom Nepali-English transliteration dataset
  • Training Steps: [Update with actual number]
  • Parameters: [Update with model size]

Intended Use

Primary Use Cases

  • Converting English names and words to Nepali Devanagari script
  • Romanizing Nepali text for international audiences
  • Supporting multilingual applications and keyboards
  • Academic research in computational linguistics
  • Cultural preservation and digital humanities projects

Out-of-Scope Use Cases

  • Machine translation (this model only handles transliteration, not translation)
  • Text generation beyond transliteration
  • Processing languages other than Nepali and English

How to Use

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = nirajan111/nepali-transliteration"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# English to Nepali
def transliterate_en_to_ne(text):
    inputs = tokenizer(f"en2ne: {text}", return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Nepali to English
def transliterate_ne_to_en(text):
    inputs = tokenizer(f"ne2en: {text}", return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
print(transliterate_en_to_ne("namaste"))  # Expected: नमस्ते
print(transliterate_ne_to_en("काठमाडौं"))  # Expected: kathmandu

Advanced Usage

# Batch processing
texts = ["namaste", "dhanyabad", "kathmandu"]
inputs = tokenizer([f"en2ne: {text}" for text in texts], 
                  return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
results = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Training Data

The model was trained on a custom dataset containing:

  • Size: [Update with dataset size, e.g., 50,000 transliteration pairs]
  • Sources:
    • Nepali names and places
    • Common vocabulary
    • Cultural terms
    • Government documents
    • Educational materials
  • Preprocessing: Text normalization, duplicate removal, quality filtering
  • Split: 80% training, 10% validation, 10% testing

Training Procedure

Training Hyperparameters

  • Batch Size: 64 (training), 16 (evaluation)
  • Learning Rate: [Update with actual value]
  • Epochs: 10
  • Optimizer: AdamW
  • Weight Decay: 0.01
  • Warmup Steps: 500
  • Max Sequence Length: 128

Training Infrastructure

  • Hardware: [Update with your setup, e.g., Tesla V100, A100]
  • Framework: PyTorch, Transformers
  • Training Time: [Update with actual time]

Evaluation

Metrics

  • BLEU Score: 0.85 (update with actual)
  • Word Accuracy: 0.92 (update with actual)
  • Character Error Rate: 0.08 (update with actual)
  • Exact Match: 0.78 (update with actual)

Test Results

Direction CER
EN → NE 0.13
NE → EN 0.10

Limitations and Bias

Known Limitations

  • Performance may vary with proper nouns not seen during training
  • Limited handling of mixed-script text
  • May struggle with very long compound words
  • Accuracy depends on text quality and standardization

Potential Biases

  • Training data may over-represent certain regions or dialects of Nepali
  • Model may have better performance on formal/literary Nepali vs. colloquial forms
  • Potential bias toward more common transliteration patterns

Ethical Considerations

  • This model supports language preservation and digital inclusion for Nepali speakers
  • Care should be taken when using for official documents or names
  • Users should verify outputs for critical applications
  • The model should not be used to misrepresent or appropriate Nepali culture

Citation

@model{nepali-transliteration-2024,
  title={Nepali Transliteration Model},
  author={Nirajan Sah},
  year={2025},
  url={https://huggingface.co/nirajan1111/nepali-transliteration-model}
}

Model Card Contact

For questions or feedback about this model, please contact: [[email protected]]

Acknowledgments

  • Thanks to the Nepali language community for providing linguistic insights
  • [Add any other acknowledgments]

Downloads last month
0
Safetensors
Model size
300M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results