license: cc-by-nc-4.0
language:
- en
- arz
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
library_name: transformers
datasets:
- IbrahimAmin/arz-en-parallel-corpus
๐ง Model Description
This model is a fine-tuned version of facebook/nllb-200-distilled-600M, specialized for English to Egyptian Arabic (arz) translation. The model was trained to improve performance on informal, dialectal text, particularly in the context of spoken Egyptian Arabic.
The base model is part of the No Language Left Behind (NLLB) initiative.
๐ฌ Intended Use
This model is intended for translating English text into Egyptian Arabic (ARZ), particularly:
- Informal speech
- Conversational and social media content
- Spoken dialogue datasets
It is not recommended for use in formal or Modern Standard Arabic (MSA) contexts, as the output will reflect dialectal structures and vocabulary.
๐๏ธ Training Details
- Base model:
facebook/nllb-200-distilled-600M
- Target language pair: English โ Egyptian Arabic (en โ arz)
- Training dataset: IbrahimAmin/arz-en-parallel-corpus
- Includes subtitle translations, synthetic translations, and conversational Egyptian Arabic text
- Covers both informal and semi-formal domains
- Framework: ๐ค Transformers + PyTorch
- Training duration: 10 epochs
- Batch size: 12
- Learning rate: 2e-5
- Encoder: Frozen
- Precision: bf16
Epoch | Training Loss | Validation Loss |
---|---|---|
0 | No log | 12.742368 |
1 | 6.736500 | 6.469766 |
2 | 6.328200 | 6.097203 |
3 | 6.004600 | 5.790025 |
4 | 5.745800 | 5.544414 |
5 | 5.537400 | 5.364527 |
6 | 5.339400 | 5.211165 |
7 | 5.224800 | 5.101339 |
8 | 5.131800 | 5.019337 |
9 | 5.076800 | 4.990577 |
10 | 5.059100 | 4.964704 |
๐งช Evaluation
We evaluated the model using BLEU score on a held-out test set of EnglishโEgyptian Arabic pairs from IbrahimAmin/arz-en-parallel-corpus.
Manual inspection suggests improved handling of:
- Idiomatic expressions
- Spoken-style phrasing
- Common Egyptian dialect vocabulary
๐ Example
Input (en):
How are you doing today?
Output (arz):
ุฅุฒูู ุงูููุงุฑุฏุฉุ
โ ๏ธ Limitations
- May hallucinate or formalize certain expressions depending on the context.
- Trained primarily on synthetic and semi-formal sources; may not generalize well to highly domain-specific jargon.
- Not suitable for translating into MSA or other Arabic dialects.
๐ Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForSeq2SeqLM.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", torch_dtype=torch.float16).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", src_lang="eng_Latn", tgt_lang="arz_Arab")
article = "How are you doing today?"
inputs = tokenizer(article, return_tensors="pt").to(device)
translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("arz_Arab"))
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
# Output: 'ุฅุฒูู ุงูููุงุฑุฏุฉุ'
โจ License
This model is distributed under the same license as the original NLLB-200 model (CC-BY-NC 4.0).
See LICENSE for details.
๐ Citation
If you use this model, please cite:
@misc{ibrahimamin2025nllb200arz,
title={NLLB-200-600M English to Egyptian Arabic},
author={Ibrahim Amin},
year={2025},
howpublished={\url{https://huggingface.co/IbrahimAmin/nllb-200-distilled-600M-en-to-arz}},
}