Mavkif's picture
Update README.md
3bd19f8 verified
---
license: apache-2.0
datasets:
- Mavkif/roman-urdu-msmarco-dataset
language:
- ur
base_model:
- unicamp-dl/mt5-base-mmarco-v2
pipeline_tag: question-answering
tags:
- mt5
- information
- retrieval
- NLP
- urdu
- roman-urdu
---
# Roman Urdu mT5 msmarco: Fine-Tuned mT5 Model for Roman-Urdu Information Retrieval
As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
We created this model by translating the MS-Marco dataset into Roman-Urdu using the IndicTrans2 model.
To establish baseline performance, we initially tested for zero-shot learning for IR in Roman-Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR
## Model Details
### Model Description
- **Developed by:** Umer Butt
- **Model type:** IR model for reranking
- **Language(s) (NLP):** Python/pytorch
## Bias, Risks, and Limitations
Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
## Evaluation
The evaluation was done using the scripts in the pygaggle library. Specifically these files:
evaluate_monot5_reranker.py
ms_marco_eval.py
### Model Architecture and Objective
```json
{
"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
"architectures": ["MT5ForConditionalGeneration"],
"d_model": 768,
"num_heads": 12,
"num_layers": 12,
"dropout_rate": 0.1,
"vocab_size": 250112,
"model_type": "mt5",
"transformers_version": "4.45.2"
}
```
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.