|
--- |
|
license: mit |
|
language: |
|
- luo |
|
- swa |
|
base_model: |
|
- facebook/nllb-200-distilled-600M |
|
pipeline_tag: translation |
|
datasets: |
|
- SalomonMetre13/luo_swa_arXiv_2501.11003 |
|
metrics: |
|
- bleu |
|
library_name: transformers |
|
--- |
|
# Model Card for nllb-luo-swa-mt-v1 |
|
|
|
## Model Overview |
|
|
|
**Model Name**: nllb-luo-swa-mt-v1 |
|
**Model Type**: Machine Translation (Luo (Dholuo) to Swahili) |
|
**Base Model**: NLLB-200-distilled-600M |
|
**Languages**: Luo (Dholuo), Swahili |
|
**Version**: 1.0 |
|
**License**: CC0 (Public Domain) |
|
**Dataset**: [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003) |
|
|
|
This model is a fine-tuned version of the `NLLB-200-distilled-600M` model for translation between Luo (Dholuo) and Swahili. It was trained on a parallel corpus derived from the Dholuo–Swahili corpus created by Mbogho et al. (2025), based on community-driven data collection efforts. |
|
|
|
## Model Description |
|
|
|
The `nllb-luo-swa-mt-v1` model performs machine translation from Luo (Dholuo) to Swahili, designed to improve translation capabilities for these low-resource languages. It was fine-tuned using the parallel corpus from the paper **"Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo"** by Mbogho et al. (2025). This model is particularly valuable for promoting linguistic diversity and facilitating the development of Natural Language Processing (NLP) tools in African languages. |
|
|
|
### Key Features: |
|
- **Training Data**: Fine-tuned on the Dholuo–Swahili parallel text corpus from the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003), derived from the grassroots data collection effort by Mbogho et al. (2025). |
|
- **Performance**: Achieved a BLEU score of 21.56 on the evaluation set, showing strong performance in a low-resource setting. |
|
- **Qualitative Analysis**: Translations generated by this model are sometimes more fluent and accurate than the provided reference translations. |
|
|
|
## Intended Use |
|
|
|
This model can be used for machine translation applications between Luo (Dholuo) and Swahili. Potential use cases include: |
|
- **Educational tools**: Enabling educational content in both languages, aiding language learners and teachers. |
|
- **Public health and community development**: Translating health information, community messages, and official communications. |
|
- **Cultural preservation**: Supporting the preservation and growth of the Luo language in the digital age. |
|
|
|
## Model Evaluation |
|
|
|
The model was evaluated using the BLEU score, which is commonly used to assess machine translation performance. A BLEU score of 21.56 was achieved, which is a strong result for a low-resource language pair. Qualitative analysis of the translations suggests that, in some cases, the model's outputs outperform the reference translations in terms of fluency and accuracy. |
|
|
|
## Training Details |
|
|
|
- **Training Data**: The model was trained on the Dholuo–Swahili parallel corpus, based on the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003) derived from Mbogho et al.'s (2025) work. The corpus includes text translations and is publicly available for further use and improvement. |
|
- **Model Architecture**: The model is fine-tuned from the `NLLB-200-distilled-600M` version of the NLLB model family, which is designed for multilingual translation tasks. |
|
|
|
## Limitations |
|
|
|
- **Low-Resource Context**: While the model performs well given the limited amount of data, its performance may still lag behind models trained on larger corpora for more widely spoken languages. |
|
- **Domain-Specific Use**: The model may require additional fine-tuning to perform optimally on domain-specific text such as medical, legal, or technical content. |
|
|
|
## Future Directions |
|
|
|
- **Expanding the Dataset**: The quality and coverage of the model could be improved by incorporating more diverse and larger datasets. |
|
- **Additional Language Pairs**: Further fine-tuning of the model to support other language pairs involving Luo and Swahili could make the model even more versatile. |
|
- **Real-World Applications**: The model could be applied to real-world projects such as translating educational materials, public health information, or community communication platforms. |
|
|
|
## Acknowledgements |
|
|
|
This model was developed based on the Dholuo–Swahili parallel corpus created by Mbogho et al.\ (2025) as part of their work in building low-resource African language corpora. The corpus was made publicly available on platforms like Zenodo and Mozilla Common Voice. |
|
|
|
## How to Use |
|
|
|
You can access the model via the Hugging Face Hub at: |
|
[https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1](https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1) |
|
|
|
To load the model using the Hugging Face `transformers` library, use the following code: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
translator = pipeline("translation", model="SalomonMetre13/nllb-luo-swa-mt-v1") |
|
translation = translator("Jajuok nomaki nyoro gotieno") |
|
print(translation) |