MLM-POS Model: Nepali-English Code-Mixed Masked Language Model

This model is a custom fine-tuned Masked Language Model (MLM) trained on code-mixed Nepali-English transliterated text. The model is capable of predicting masked tokens in sentences that include both Romanized Nepali and English, aiming to support tasks like:

Transliteration prediction
Spelling correction in Roman Nepali
Context-aware infilling for code-mixed data

Model Details

Base Model: xlm-roberta-base
Objective: Masked Language Modeling (MLM)
Training Data: Custom dataset including:
- Transliterated Nepali sentences
- Corresponding English equivalents (optional or paired via |||)
- POS-tag supervision (used during pre-processing)
MLM Probability: Configurable, trained with up to 100% masking for experimental spell reconstruction
Training Steps: 100,000 steps using Hugging Face Trainer

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

checkpoint_path = "/path/to/mlm-pos-model/checkpoint-100000"
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
model = AutoModelForMaskedLM.from_pretrained(checkpoint_path)

mlm_pipeline = pipeline("fill-mask", model=model, tokenizer=tokenizer)

result = mlm_pipeline("नेपाल इज गूड प्लेस टू बी। ||| Nepal is good place <mask> be.", top_k=8)
for pred in result:
    print(f"{pred['token_str']} ({pred['score']:.4f})")