File size: 6,949 Bytes
bb51c47
d7e7f06
 
bb51c47
 
 
d7e7f06
3d4d1fe
 
d7e7f06
3d4d1fe
 
 
 
 
 
 
 
d7e7f06
f3cfaf6
3d4d1fe
d7e7f06
f3cfaf6
73a3e86
42c411e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be0a05b
 
 
 
 
 
42c411e
 
 
 
d7e7f06
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
base_model:
- UBC-NLP/AraT5v2-base-1024
language:
- ar
library_name: transformers
license: apache-2.0
metrics:
- bleu
pipeline_tag: translation
tags:
- Syrian
- Shami
- MT
- MSA
- Dialect
- ArabicNLP
---

# SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect 

This model is based on the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/eyHzopOleQcVFz9LkO6Nv.png)

## Model Description

SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.

## Model Details

- **Model Type**: Sequence-to-Sequence Translation
- **Base Model**: UBC-NLP/AraT5v2-base-1024
- **Language**: Arabic (MSA → Syrian Dialect)
- **License**: Apache 2.0
- **Library**: Transformers

## Dataset

The model was trained on the **Nâbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/AaN6gPticioHBTXdPsroy.png)

### Nâbra Dataset Details

**Citation:**
```
Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023). 
Nâbra: Syrian Arabic dialects with morphological annotations. 
arXiv preprint arXiv:2310.17315.
```

**Key Statistics:**
- **Tokens**: ~60,000 words
- **Dialects Covered**: Multiple Syrian regional dialects including:
  - Aleppo
  - Damascus
  - Deir-ezzur
  - Hama
  - Homs
  - Huran
  - Latakia
  - Mardin
  - Raqqah
  - Suwayda

**Data Sources:**
- Social media posts
- Movie and TV series scripts
- Song lyrics
- Local proverbs

## Training Details

The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:

- **Total Training Steps**: 10,384
- **Epochs**: 22
- **Final Training Loss**: 1.396
- **Final Evaluation Loss**: 0.771
- **Learning Rate**: Cosine schedule starting at 5e-5
- **Batch Size**: 256
- **Total FLOPs**: 1.58e+17

### Training Progress

The model showed consistent improvement throughout training:
- Initial loss: 12.93 → Final loss: 1.40
- Evaluation loss steadily decreased from 1.44 to 0.77
- Gradient norms remained stable throughout training

## Usage

### Installation

```bash
pip install transformers torch
```

### Inference Code

```python
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")

# Example usage
ar_prompt = "مرحبا بك هنا"  # MSA input
input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print("Input (MSA):", ar_prompt)
print("Tokenized input:", tokenizer.tokenize(ar_prompt))
print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Generation Parameters

For optimal results, you can adjust generation parameters:

```python
outputs = model.generate(
    input_ids,
    max_length=128,
    num_beams=4,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)
```
### Evaluation Results
- **Test Set**: 1,500 unseen sentences
- **Evaluation Method**: GPT-4.1 as automated judge
- **Average Score**: **4.01/5.0**- **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation

The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:

```
"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
Please return a rating from 0 to 5 and a short comment.

MSA Input: [input sentence]
Model Prediction (Shami dialect): [model output]
Ground Truth (Shami dialect): [reference translation]

Respond in this format:
Score: <number from 0 to 5>
Comment: <brief explanation of the score>"
```

**Score Distribution Analysis:**
- **Excellent (5.0)**: High-quality translations with perfect dialectal conversion
- **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences
- **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies
- **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning
- **Poor (0-1.9)**: Significant translation errors or loss of meaning

### Performance Highlights
- **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect
- **Semantic Preservation**: Maintains original meaning while adapting linguistic style
- **Regional Adaptability**: Handles various Syrian sub-dialects effectively
- **Consistent Quality**: Stable performance across different text types and domains
  
## Applications

This model is particularly useful for:
- **Content Localization**: Adapting MSA content for Syrian audiences
- **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations
- **Educational Tools**: Teaching differences between MSA and Syrian dialect
- **Research**: Syrian Arabic NLP and dialectology studies

## Regional Coverage

The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:

🏛️ **Urban Centers**: Damascus, Aleppo  
🏔️ **Northern Regions**: Latakia, Mardin  
🏜️ **Eastern Areas**: Deir-ezzur, Raqqah  
🌄 **Central/Southern**: Hama, Homs, Huran, Suwayda

## Limitations

- Trained specifically on Syrian dialect variations
- Performance may vary for other Arabic dialects
- Limited to text-based translation (no speech support)
- Dataset size constraints may affect handling of very rare dialectal expressions

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{shami-mt-2024,
  title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
  author={Omartificial Intelligence Space},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
}

@article{nayouf2023nabra,
  title={Nâbra: Syrian Arabic dialects with morphological annotations},
  author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
  journal={arXiv preprint arXiv:2310.17315},
  year={2023}
}

@misc{onajar2025shamiMT,
  title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA},
  author={Sibaee, Serry and Nacar, Omer},
  year={2025}
}
```

## Contact & Support

For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.