Update README.md
Browse files
README.md
CHANGED
@@ -1,59 +1,67 @@
|
|
1 |
-
|
2 |
-
library_name: transformers
|
3 |
-
license: cc-by-nc-4.0
|
4 |
-
base_model: SalomonMetre13/nllb-luo-swa-mt-v1
|
5 |
-
tags:
|
6 |
-
- generated_from_trainer
|
7 |
-
model-index:
|
8 |
-
- name: nllb-luo-swa-mt-v1
|
9 |
-
results: []
|
10 |
-
---
|
11 |
|
12 |
-
|
13 |
-
should probably proofread and complete it, then remove this comment. -->
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
-
This model is a fine-tuned version of
|
18 |
-
It achieves the following results on the evaluation set:
|
19 |
-
- eval_loss: 0.1146
|
20 |
-
- eval_bleu: 19.64
|
21 |
-
- eval_runtime: 798.5876
|
22 |
-
- eval_samples_per_second: 3.665
|
23 |
-
- eval_steps_per_second: 0.917
|
24 |
-
- epoch: 0.4556
|
25 |
-
- step: 3000
|
26 |
|
27 |
-
## Model
|
28 |
|
29 |
-
|
30 |
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
-
|
|
|
|
|
|
|
34 |
|
35 |
-
##
|
36 |
|
37 |
-
|
38 |
|
39 |
-
## Training
|
40 |
|
41 |
-
|
|
|
42 |
|
43 |
-
|
44 |
-
- learning_rate: 3e-05
|
45 |
-
- train_batch_size: 4
|
46 |
-
- eval_batch_size: 4
|
47 |
-
- seed: 42
|
48 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
49 |
-
- lr_scheduler_type: linear
|
50 |
-
- lr_scheduler_warmup_steps: 200
|
51 |
-
- num_epochs: 10
|
52 |
-
- mixed_precision_training: Native AMP
|
53 |
|
54 |
-
|
|
|
55 |
|
56 |
-
|
57 |
-
|
58 |
-
-
|
59 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card for nllb-luo-swa-mt-v1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
+
## Model Overview
|
|
|
4 |
|
5 |
+
**Model Name**: nllb-luo-swa-mt-v1
|
6 |
+
**Model Type**: Machine Translation (Luo (Dholuo) to Swahili)
|
7 |
+
**Base Model**: NLLB-200-distilled-600M
|
8 |
+
**Languages**: Luo (Dholuo), Swahili
|
9 |
+
**Version**: 1.0
|
10 |
+
**License**: CC0 (Public Domain)
|
11 |
+
**Dataset**: [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003)
|
12 |
|
13 |
+
This model is a fine-tuned version of the `NLLB-200-distilled-600M` model for translation between Luo (Dholuo) and Swahili. It was trained on a parallel corpus derived from the Dholuo–Swahili corpus created by Mbogho et al. (2025), based on community-driven data collection efforts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
+
## Model Description
|
16 |
|
17 |
+
The `nllb-luo-swa-mt-v1` model performs machine translation from Luo (Dholuo) to Swahili, designed to improve translation capabilities for these low-resource languages. It was fine-tuned using the parallel corpus from the paper **"Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo"** by Mbogho et al. (2025). This model is particularly valuable for promoting linguistic diversity and facilitating the development of Natural Language Processing (NLP) tools in African languages.
|
18 |
|
19 |
+
### Key Features:
|
20 |
+
- **Training Data**: Fine-tuned on the Dholuo–Swahili parallel text corpus from the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003), derived from the grassroots data collection effort by Mbogho et al. (2025).
|
21 |
+
- **Performance**: Achieved a BLEU score of 21.56 on the evaluation set, showing strong performance in a low-resource setting.
|
22 |
+
- **Qualitative Analysis**: Translations generated by this model are sometimes more fluent and accurate than the provided reference translations.
|
23 |
+
|
24 |
+
## Intended Use
|
25 |
|
26 |
+
This model can be used for machine translation applications between Luo (Dholuo) and Swahili. Potential use cases include:
|
27 |
+
- **Educational tools**: Enabling educational content in both languages, aiding language learners and teachers.
|
28 |
+
- **Public health and community development**: Translating health information, community messages, and official communications.
|
29 |
+
- **Cultural preservation**: Supporting the preservation and growth of the Luo language in the digital age.
|
30 |
|
31 |
+
## Model Evaluation
|
32 |
|
33 |
+
The model was evaluated using the BLEU score, which is commonly used to assess machine translation performance. A BLEU score of 21.56 was achieved, which is a strong result for a low-resource language pair. Qualitative analysis of the translations suggests that, in some cases, the model's outputs outperform the reference translations in terms of fluency and accuracy.
|
34 |
|
35 |
+
## Training Details
|
36 |
|
37 |
+
- **Training Data**: The model was trained on the Dholuo–Swahili parallel corpus, based on the dataset [SalomonMetre13/luo_swa_arXiv_2501.11003](https://huggingface.co/datasets/SalomonMetre13/luo_swa_arXiv_2501.11003) derived from Mbogho et al.'s (2025) work. The corpus includes text translations and is publicly available for further use and improvement.
|
38 |
+
- **Model Architecture**: The model is fine-tuned from the `NLLB-200-distilled-600M` version of the NLLB model family, which is designed for multilingual translation tasks.
|
39 |
|
40 |
+
## Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
+
- **Low-Resource Context**: While the model performs well given the limited amount of data, its performance may still lag behind models trained on larger corpora for more widely spoken languages.
|
43 |
+
- **Domain-Specific Use**: The model may require additional fine-tuning to perform optimally on domain-specific text such as medical, legal, or technical content.
|
44 |
|
45 |
+
## Future Directions
|
46 |
+
|
47 |
+
- **Expanding the Dataset**: The quality and coverage of the model could be improved by incorporating more diverse and larger datasets.
|
48 |
+
- **Additional Language Pairs**: Further fine-tuning of the model to support other language pairs involving Luo and Swahili could make the model even more versatile.
|
49 |
+
- **Real-World Applications**: The model could be applied to real-world projects such as translating educational materials, public health information, or community communication platforms.
|
50 |
+
|
51 |
+
## Acknowledgements
|
52 |
+
|
53 |
+
This model was developed based on the Dholuo–Swahili parallel corpus created by Mbogho et al.\ (2025) as part of their work in building low-resource African language corpora. The corpus was made publicly available on platforms like Zenodo and Mozilla Common Voice.
|
54 |
+
|
55 |
+
## How to Use
|
56 |
+
|
57 |
+
You can access the model via the Hugging Face Hub at:
|
58 |
+
[https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1](https://huggingface.co/SalomonMetre13/nllb-luo-swa-mt-v1)
|
59 |
+
|
60 |
+
To load the model using the Hugging Face `transformers` library, use the following code:
|
61 |
+
|
62 |
+
```python
|
63 |
+
from transformers import pipeline
|
64 |
+
|
65 |
+
translator = pipeline("translation", model="SalomonMetre13/nllb-luo-swa-mt-v1")
|
66 |
+
translation = translator("Ninapenda kujua kuhusu lugha ya Dholuo.")
|
67 |
+
print(translation)
|