m2m100_rup_tokenizer_both

This repository hosts the shared tokenizer used for our Roman Urdu ↔ Urdu transliteration models:

It is based on M2M100Tokenizer and extended with custom language tokens:

  • __ur__ for Urdu
  • __roman-ur__ for Roman Urdu

These tokens are stored in added_tokens.json and are required for correct transliteration.


When preparing input for models, prepend the correct language token (__roman-ur__ or __ur__) to the text.

@inproceedings{butt2025romanurdu, title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, author = {Umer Butt, Stalin Varanasi, Günter Neumann}, year = {2025}, booktitle = {LoResMT Workshop @ NAACL 2025} }

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support