
Hugging Face π€ space
arXiv π paper
We present TamyΓ―z, an accurate and robust Transformer-based model for Arabic Dialect Identification (ADI) in speech. We adapt the pre-trained massively multilingual speech (MMS) model and fine-tune it on diverse Arabic TV broadcast speech to identify the following Arabic language varieties:
- Modern Standard Arabic (MSA)
- Egyptian Arabic (Masri and Sudani)
- Gulf Arabic (Khleeji, Iraqi, and Yemeni)
- Levantine Arabic (Shami)
- Maghrebi Arabic (Dialects of al-Maghreb al-Arabi in North Africa)
Model Use Cases βοΈ
The model can be used as a component in a large-scale speech data collection pipeline to create resources for different Arabic dialects. It can also be used to filter speech data for Modern Standard Arabic (MSA) for text-to-speech (TTS) systems.
In Hugging Face π€ Transformers library
Consider this speech segment as an example
Now we can use the model to identify the dialect of the speaker as follows
from transformers import pipeline
# Load the model
model_id = "badrex/mms-300m-arabic-dialect-identifier"
adi5_classifier = pipeline(
"audio-classification",
model=model_id,
device='cpu' # or device = 'cuda' if you are connected to a GPU
)
# Predict dialect for an audio sample
audio_path = "https://huggingface.co/badrex/mms-300m-arabic-dialect-identifier/blob/main/examples/Da7ee7.mp3"
predictions = adi5_classifier(audio_path)
for pred in predictions:
print(f"Dialect: {pred['label']:<10} Confidence: {pred['score']:.4f}")
For this example, you will get the following output
Dialect: Egyptian Confidence: 0.9926
Dialect: MSA Confidence: 0.0040
Dialect: Levantine Confidence: 0.0033
Dialect: Maghrebi Confidence: 0.0001
Dialect: Gulf Confidence: 0.0000
Here, the model predicts the dialect correctly π₯³
The model was trained to handle variation in recording environment and should do reasonably well on noisy speech segments. Consider this noisy speech segment from an old theatre recording
Using the model to make the prediciton as above, we get the following ouput
Dialect: MSA Confidence: 0.9636
Dialect: Levantine Confidence: 0.0319
Dialect: Egyptian Confidence: 0.0023
Dialect: Gulf Confidence: 0.0019
Dialect: Maghrebi Confidence: 0.0003
Once again, the model makes the correct prediction π
β οΈ Caution: Make sure your audio is sampled at 16kHz. If not, you should use librosa or torch to resample the audio.
Info βΉοΈ
- Developed by: Badr M. Abdullah and Matthew Baas
- Model type: wav2vec 2.0 architecture
- Language: Arabic (and its varieties)
- License: Creative Commons Attribution 4.0 (CC BY 4.0)
- Finetuned from model: MMS-300m [https://huggingface.co/facebook/mms-300m]
Training Data π’οΈ
Trained on the MGB-3 ADI-5 dataset, which consists of TV Broadcast speech from Al Jazeera TV (news, interviews, discussions, TV shows, etc.)
Evaluation π
The model has been evaluated on the challenging multi-domain MADIS-5 benchmark. The model performed very well in our evaluation and is expected it to be robust to real-world speech samples.
Out-of-Scope Use β
The model should not be used to
- Assess fluency or nativeness of speech
- Determine whether the speaker uses a formal or informal register
- Make judgments about a speaker's origin, education level, or socioeconomic status
- Filter or discriminate against speakers based on dialect
Bias, Risks, and Limitations β οΈ
Some Arabic varieties are not well-represented in the training data. The model may not work well for some dialects such as Yemeni Arabic, Iraqi Arabic, and Saharan Arabic.
Additional limitations include:
- Very short audio samples (< 2 seconds) may not provide enough information for accurate classification
- Code-switching between dialects (specially mixing with MSA) may result in less reliable classifications
- Speakers who have lived in multiple dialect regions may exhibit mixed features
- Speech from non-typical speakers such as children and people with speech disorders might be challenging for the model
Recommendations π
- For optimal results, use audio segments of at least 5-10 seconds
- Confidence scores may not always be informative (e.g., the model could make a wrong decision but still very confident)
- For critical applications, consider human verification of model predictions
Citation βοΈ
If you use this dataset in your research, please cite our paper:
BibTeX:
@inproceedings{abdullah2025voice,
title={Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification},
author={Badr M. Abdullah and Matthew Baas and Bernd MΓΆbius and Dietrich Klakow},
year={2025},
publisher={Interspeech},
url={https://arxiv.org/pdf/2505.24713}
}
Model Card Contact π§
If you have any questions, please do not hesitate to write an email to badr dot nlp at gmail dot com π
- Downloads last month
- 309
Model tree for badrex/mms-300m-arabic-dialect-identifier
Base model
facebook/mms-300m