|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Mavkif/roman-urdu-msmarco-dataset |
|
language: |
|
- ur |
|
base_model: |
|
- unicamp-dl/mt5-base-mmarco-v2 |
|
pipeline_tag: question-answering |
|
tags: |
|
- mt5 |
|
- information |
|
- retrieval |
|
- NLP |
|
- urdu |
|
- roman-urdu |
|
--- |
|
|
|
|
|
# Roman Urdu mT5 msmarco: Fine-Tuned mT5 Model for Roman-Urdu Information Retrieval |
|
|
|
As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. |
|
We created this model by translating the MS-Marco dataset into Roman-Urdu using the IndicTrans2 model. |
|
To establish baseline performance, we initially tested for zero-shot learning for IR in Roman-Urdu using the unicamp-dl/mt5-base-mmarco-v2 model |
|
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
|
|
|
|
- **Developed by:** Umer Butt |
|
- **Model type:** IR model for reranking |
|
- **Language(s) (NLP):** Python/pytorch |
|
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too. |
|
|
|
|
|
|
|
## Evaluation |
|
|
|
The evaluation was done using the scripts in the pygaggle library. Specifically these files: |
|
evaluate_monot5_reranker.py |
|
ms_marco_eval.py |
|
|
|
|
|
### Model Architecture and Objective |
|
```json |
|
{ |
|
"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2", |
|
"architectures": ["MT5ForConditionalGeneration"], |
|
"d_model": 768, |
|
"num_heads": 12, |
|
"num_layers": 12, |
|
"dropout_rate": 0.1, |
|
"vocab_size": 250112, |
|
"model_type": "mt5", |
|
"transformers_version": "4.45.2" |
|
} |
|
``` |
|
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation. |
|
|