SHAMI-MT-2MSA : A Machine Translation Model From Syrian Dialect to MSA

This model is part of the work presented in the paper SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System.

Paper Abstract

The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.

image/png

Model Description

SHAMI-MT-2MSA is one of two specialized models that constitute the SHAMI-MT bidirectional machine translation system. This particular model is designed to translate from Syrian dialect to Modern Standard Arabic (MSA). Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.

Usage

This model can be used directly with the Hugging Face transformers library.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load tokenizer and model
model_id = "Omartificial-Intelligence-Space/Shami-MT-2MSA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Example input: Syrian Arabic dialect
input_text = "ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ" # "How are you today?" in Syrian dialect
inputs = tokenizer(input_text, return_tensors="pt")

# Generate translation
outputs = model.generate(**inputs, max_new_tokens=128) # Added max_new_tokens for generation to prevent infinite loop
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Syrian Dialect: {input_text}")
print(f"Modern Standard Arabic: {translated_text}")

Citation

If you use this model in your research, please cite the main paper and the dataset paper:

@article{sibaee2025shamimt,
  title={SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System},
  author={Sibaee, Serry and Nacar, Omer},
  year={2025},
  journal={Hugging Face Papers},
  url={https://huggingface.co/papers/2508.02268}
}

@article{nayouf2023nabra,
  title={Nรขbra: Syrian Arabic dialects with morphological annotations},
  author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
  journal={arXiv preprint arXiv:2310.17315},
  year={2023}
}

Contact & Support

For questions, issues, or contributions, please visit the model repository or contact the development team.

Downloads last month
20
Safetensors
Model size
368M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Omartificial-Intelligence-Space/SHAMI-MT-2MSA

Finetuned
(21)
this model

Space using Omartificial-Intelligence-Space/SHAMI-MT-2MSA 1

Collection including Omartificial-Intelligence-Space/SHAMI-MT-2MSA