m2m100_rup_tokenizer_both
This repository hosts the shared tokenizer used for our Roman Urdu ↔ Urdu transliteration models:
It is based on M2M100Tokenizer and extended with custom language tokens:
__ur__
for Urdu__roman-ur__
for Roman Urdu
These tokens are stored in added_tokens.json
and are required for correct transliteration.
When preparing input for models, prepend the correct language token (__roman-ur__
or __ur__
) to the text.
@inproceedings{butt2025romanurdu, title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, author = {Umer Butt, Stalin Varanasi, Günter Neumann}, year = {2025}, booktitle = {LoResMT Workshop @ NAACL 2025} }
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support