nllb-luo-swa-mt-v1 / README.md
SalomonMetre13's picture
Update README.md
9680088 verified
---
license: mit
language:
- luo
- swa
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
datasets:
- SalomonMetre13/luo_swa_arXiv_2501.11003
metrics:
- bleu
library_name: transformers
---
# Model Card for nllb-luo-swa-mt-v1
## Model Overview
**Model Name**: nllb-luo-swa-mt-v1
**Model Type**: Machine Translation (Luo (Dholuo) to Swahili)
**Base Model**: NLLB-200-distilled-600M
**Languages**: Luo (Dholuo), Swahili
**Version**: 1.0
**License**: CC0 (Public Domain)
**Dataset**: [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003)
This model is a fine-tuned version of the `NLLB-200-distilled-600M` model for translation between Luo (Dholuo) and Swahili. It was trained on a parallel corpus derived from the Dholuo–Swahili corpus created by Mbogho et al. (2025), based on community-driven data collection efforts.
## Model Description
The `nllb-luo-swa-mt-v1` model performs machine translation from Luo (Dholuo) to Swahili, designed to improve translation capabilities for these low-resource languages. It was fine-tuned using the parallel corpus from the paper **"Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo"** by Mbogho et al. (2025). This model is particularly valuable for promoting linguistic diversity and facilitating the development of Natural Language Processing (NLP) tools in African languages.
### Key Features:
- **Training Data**: Fine-tuned on the Dholuo–Swahili parallel text corpus from the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003), derived from the grassroots data collection effort by Mbogho et al. (2025).
- **Performance**: Achieved a BLEU score of 21.56 on the evaluation set, showing strong performance in a low-resource setting.
- **Qualitative Analysis**: Translations generated by this model are sometimes more fluent and accurate than the provided reference translations.
## Intended Use
This model can be used for machine translation applications between Luo (Dholuo) and Swahili. Potential use cases include:
- **Educational tools**: Enabling educational content in both languages, aiding language learners and teachers.
- **Public health and community development**: Translating health information, community messages, and official communications.
- **Cultural preservation**: Supporting the preservation and growth of the Luo language in the digital age.
## Model Evaluation
The model was evaluated using the BLEU score, which is commonly used to assess machine translation performance. A BLEU score of 21.56 was achieved, which is a strong result for a low-resource language pair. Qualitative analysis of the translations suggests that, in some cases, the model's outputs outperform the reference translations in terms of fluency and accuracy.
## Training Details
- **Training Data**: The model was trained on the Dholuo–Swahili parallel corpus, based on the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003) derived from Mbogho et al.'s (2025) work. The corpus includes text translations and is publicly available for further use and improvement.
- **Model Architecture**: The model is fine-tuned from the `NLLB-200-distilled-600M` version of the NLLB model family, which is designed for multilingual translation tasks.
## Limitations
- **Low-Resource Context**: While the model performs well given the limited amount of data, its performance may still lag behind models trained on larger corpora for more widely spoken languages.
- **Domain-Specific Use**: The model may require additional fine-tuning to perform optimally on domain-specific text such as medical, legal, or technical content.
## Future Directions
- **Expanding the Dataset**: The quality and coverage of the model could be improved by incorporating more diverse and larger datasets.
- **Additional Language Pairs**: Further fine-tuning of the model to support other language pairs involving Luo and Swahili could make the model even more versatile.
- **Real-World Applications**: The model could be applied to real-world projects such as translating educational materials, public health information, or community communication platforms.
## Acknowledgements
This model was developed based on the Dholuo–Swahili parallel corpus created by Mbogho et al.\ (2025) as part of their work in building low-resource African language corpora. The corpus was made publicly available on platforms like Zenodo and Mozilla Common Voice.
## How to Use
You can access the model via the Hugging Face Hub at:
[https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1](https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1)
To load the model using the Hugging Face `transformers` library, use the following code:
```python
from transformers import pipeline
translator = pipeline("translation", model="SalomonMetre13/nllb-luo-swa-mt-v1")
translation = translator("Jajuok nomaki nyoro gotieno")
print(translation)