File size: 6,949 Bytes
bb51c47 d7e7f06 bb51c47 d7e7f06 3d4d1fe d7e7f06 3d4d1fe d7e7f06 f3cfaf6 3d4d1fe d7e7f06 f3cfaf6 73a3e86 42c411e be0a05b 42c411e d7e7f06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
base_model:
- UBC-NLP/AraT5v2-base-1024
language:
- ar
library_name: transformers
license: apache-2.0
metrics:
- bleu
pipeline_tag: translation
tags:
- Syrian
- Shami
- MT
- MSA
- Dialect
- ArabicNLP
---
# SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect
This model is based on the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268).

## Model Description
SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.
## Model Details
- **Model Type**: Sequence-to-Sequence Translation
- **Base Model**: UBC-NLP/AraT5v2-base-1024
- **Language**: Arabic (MSA → Syrian Dialect)
- **License**: Apache 2.0
- **Library**: Transformers
## Dataset
The model was trained on the **Nâbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.

### Nâbra Dataset Details
**Citation:**
```
Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023).
Nâbra: Syrian Arabic dialects with morphological annotations.
arXiv preprint arXiv:2310.17315.
```
**Key Statistics:**
- **Tokens**: ~60,000 words
- **Dialects Covered**: Multiple Syrian regional dialects including:
- Aleppo
- Damascus
- Deir-ezzur
- Hama
- Homs
- Huran
- Latakia
- Mardin
- Raqqah
- Suwayda
**Data Sources:**
- Social media posts
- Movie and TV series scripts
- Song lyrics
- Local proverbs
## Training Details
The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:
- **Total Training Steps**: 10,384
- **Epochs**: 22
- **Final Training Loss**: 1.396
- **Final Evaluation Loss**: 0.771
- **Learning Rate**: Cosine schedule starting at 5e-5
- **Batch Size**: 256
- **Total FLOPs**: 1.58e+17
### Training Progress
The model showed consistent improvement throughout training:
- Initial loss: 12.93 → Final loss: 1.40
- Evaluation loss steadily decreased from 1.44 to 0.77
- Gradient norms remained stable throughout training
## Usage
### Installation
```bash
pip install transformers torch
```
### Inference Code
```python
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
# Example usage
ar_prompt = "مرحبا بك هنا" # MSA input
input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("Input (MSA):", ar_prompt)
print("Tokenized input:", tokenizer.tokenize(ar_prompt))
print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Generation Parameters
For optimal results, you can adjust generation parameters:
```python
outputs = model.generate(
input_ids,
max_length=128,
num_beams=4,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
```
### Evaluation Results
- **Test Set**: 1,500 unseen sentences
- **Evaluation Method**: GPT-4.1 as automated judge
- **Average Score**: **4.01/5.0** ⭐
- **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation
The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:
```
"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
Please return a rating from 0 to 5 and a short comment.
MSA Input: [input sentence]
Model Prediction (Shami dialect): [model output]
Ground Truth (Shami dialect): [reference translation]
Respond in this format:
Score: <number from 0 to 5>
Comment: <brief explanation of the score>"
```
**Score Distribution Analysis:**
- **Excellent (5.0)**: High-quality translations with perfect dialectal conversion
- **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences
- **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies
- **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning
- **Poor (0-1.9)**: Significant translation errors or loss of meaning
### Performance Highlights
- **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect
- **Semantic Preservation**: Maintains original meaning while adapting linguistic style
- **Regional Adaptability**: Handles various Syrian sub-dialects effectively
- **Consistent Quality**: Stable performance across different text types and domains
## Applications
This model is particularly useful for:
- **Content Localization**: Adapting MSA content for Syrian audiences
- **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations
- **Educational Tools**: Teaching differences between MSA and Syrian dialect
- **Research**: Syrian Arabic NLP and dialectology studies
## Regional Coverage
The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:
🏛️ **Urban Centers**: Damascus, Aleppo
🏔️ **Northern Regions**: Latakia, Mardin
🏜️ **Eastern Areas**: Deir-ezzur, Raqqah
🌄 **Central/Southern**: Hama, Homs, Huran, Suwayda
## Limitations
- Trained specifically on Syrian dialect variations
- Performance may vary for other Arabic dialects
- Limited to text-based translation (no speech support)
- Dataset size constraints may affect handling of very rare dialectal expressions
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{shami-mt-2024,
title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
author={Omartificial Intelligence Space},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
}
@article{nayouf2023nabra,
title={Nâbra: Syrian Arabic dialects with morphological annotations},
author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
journal={arXiv preprint arXiv:2310.17315},
year={2023}
}
@misc{onajar2025shamiMT,
title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA},
author={Sibaee, Serry and Nacar, Omer},
year={2025}
}
```
## Contact & Support
For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team. |