--- license: mit language: - luo - swa base_model: - facebook/nllb-200-distilled-600M pipeline_tag: translation datasets: - SalomonMetre13/luo_swa_arXiv_2501.11003 metrics: - bleu library_name: transformers --- # Model Card for nllb-luo-swa-mt-v1 ## Model Overview **Model Name**: nllb-luo-swa-mt-v1 **Model Type**: Machine Translation (Luo (Dholuo) to Swahili) **Base Model**: NLLB-200-distilled-600M **Languages**: Luo (Dholuo), Swahili **Version**: 1.0 **License**: CC0 (Public Domain) **Dataset**: [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003) This model is a fine-tuned version of the `NLLB-200-distilled-600M` model for translation between Luo (Dholuo) and Swahili. It was trained on a parallel corpus derived from the Dholuo–Swahili corpus created by Mbogho et al. (2025), based on community-driven data collection efforts. ## Model Description The `nllb-luo-swa-mt-v1` model performs machine translation from Luo (Dholuo) to Swahili, designed to improve translation capabilities for these low-resource languages. It was fine-tuned using the parallel corpus from the paper **"Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo"** by Mbogho et al. (2025). This model is particularly valuable for promoting linguistic diversity and facilitating the development of Natural Language Processing (NLP) tools in African languages. ### Key Features: - **Training Data**: Fine-tuned on the Dholuo–Swahili parallel text corpus from the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003), derived from the grassroots data collection effort by Mbogho et al. (2025). - **Performance**: Achieved a BLEU score of 21.56 on the evaluation set, showing strong performance in a low-resource setting. - **Qualitative Analysis**: Translations generated by this model are sometimes more fluent and accurate than the provided reference translations. ## Intended Use This model can be used for machine translation applications between Luo (Dholuo) and Swahili. Potential use cases include: - **Educational tools**: Enabling educational content in both languages, aiding language learners and teachers. - **Public health and community development**: Translating health information, community messages, and official communications. - **Cultural preservation**: Supporting the preservation and growth of the Luo language in the digital age. ## Model Evaluation The model was evaluated using the BLEU score, which is commonly used to assess machine translation performance. A BLEU score of 21.56 was achieved, which is a strong result for a low-resource language pair. Qualitative analysis of the translations suggests that, in some cases, the model's outputs outperform the reference translations in terms of fluency and accuracy. ## Training Details - **Training Data**: The model was trained on the Dholuo–Swahili parallel corpus, based on the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003) derived from Mbogho et al.'s (2025) work. The corpus includes text translations and is publicly available for further use and improvement. - **Model Architecture**: The model is fine-tuned from the `NLLB-200-distilled-600M` version of the NLLB model family, which is designed for multilingual translation tasks. ## Limitations - **Low-Resource Context**: While the model performs well given the limited amount of data, its performance may still lag behind models trained on larger corpora for more widely spoken languages. - **Domain-Specific Use**: The model may require additional fine-tuning to perform optimally on domain-specific text such as medical, legal, or technical content. ## Future Directions - **Expanding the Dataset**: The quality and coverage of the model could be improved by incorporating more diverse and larger datasets. - **Additional Language Pairs**: Further fine-tuning of the model to support other language pairs involving Luo and Swahili could make the model even more versatile. - **Real-World Applications**: The model could be applied to real-world projects such as translating educational materials, public health information, or community communication platforms. ## Acknowledgements This model was developed based on the Dholuo–Swahili parallel corpus created by Mbogho et al.\ (2025) as part of their work in building low-resource African language corpora. The corpus was made publicly available on platforms like Zenodo and Mozilla Common Voice. ## How to Use You can access the model via the Hugging Face Hub at: [https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1](https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1) To load the model using the Hugging Face `transformers` library, use the following code: ```python from transformers import pipeline translator = pipeline("translation", model="SalomonMetre13/nllb-luo-swa-mt-v1") translation = translator("Jajuok nomaki nyoro gotieno") print(translation)