MLM-POS Model: Nepali-English Code-Mixed Masked Language Model

This model is a custom fine-tuned Masked Language Model (MLM) trained on code-mixed Nepali-English transliterated text. The model is capable of predicting masked tokens in sentences that include both Romanized Nepali and English, aiming to support tasks like:

  • Transliteration prediction
  • Spelling correction in Roman Nepali
  • Context-aware infilling for code-mixed data

Model Details

  • Base Model: xlm-roberta-base
  • Objective: Masked Language Modeling (MLM)
  • Training Data: Custom dataset including:
    • Transliterated Nepali sentences
    • Corresponding English equivalents (optional or paired via |||)
    • POS-tag supervision (used during pre-processing)
  • MLM Probability: Configurable, trained with up to 100% masking for experimental spell reconstruction
  • Training Steps: 100,000 steps using Hugging Face Trainer

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

checkpoint_path = "/path/to/mlm-pos-model/checkpoint-100000"
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
model = AutoModelForMaskedLM.from_pretrained(checkpoint_path)

mlm_pipeline = pipeline("fill-mask", model=model, tokenizer=tokenizer)

result = mlm_pipeline("नेपाल इज गूड प्लेस टू बी। ||| Nepal is good place <mask> be.", top_k=8)
for pred in result:
    print(f"{pred['token_str']} ({pred['score']:.4f})")
Downloads last month
10
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support