Hugging Face 🤗 space

arXiv 📖 paper

We present Tamyïz, an accurate and robust Transformer-based model for Arabic Dialect Identification (ADI) in speech. We adapt the pre-trained massively multilingual speech (MMS) model and fine-tune it on diverse Arabic TV broadcast speech to identify the following Arabic language varieties:

Modern Standard Arabic (MSA)
Egyptian Arabic (Masri and Sudani)
Gulf Arabic (Khleeji, Iraqi, and Yemeni)
Levantine Arabic (Shami)
Maghrebi Arabic (Dialects of al-Maghreb al-Arabi in North Africa)

Model Use Cases ⚙️

The model can be used as a component in a large-scale speech data collection pipeline to create resources for different Arabic dialects. It can also be used to filter speech data for Modern Standard Arabic (MSA) for text-to-speech (TTS) systems.

In Hugging Face 🤗 Transformers library

Consider this speech segment as an example

Download Audio

Now we can use the model to identify the dialect of the speaker as follows

from transformers import pipeline

# Load the model
model_id = "badrex/mms-300m-arabic-dialect-identifier"
adi5_classifier = pipeline(
    "audio-classification", 
    model=model_id,
    device='cpu' # or device = 'cuda' if you are connected to a GPU
)

# Predict dialect for an audio sample 
audio_path = "https://huggingface.co/badrex/mms-300m-arabic-dialect-identifier/resolve/main/examples/Da7ee7.mp3"

predictions = adi5_classifier(audio_path)

for pred in predictions:
    print(f"Dialect: {pred['label']:<10} Confidence: {pred['score']:.4f}")

For this example, you will get the following output

Dialect: Egyptian   Confidence: 0.9926
Dialect: MSA        Confidence: 0.0040
Dialect: Levantine  Confidence: 0.0033
Dialect: Maghrebi   Confidence: 0.0001
Dialect: Gulf       Confidence: 0.0000

Here, the model predicts the dialect correctly 🥳

The model was trained to handle variation in recording environment and should do reasonably well on noisy speech segments. Consider this noisy speech segment from an old theatre recording

Download Audio

Using the model to make the prediciton as above, we get the following ouput

Dialect: MSA        Confidence: 0.9636
Dialect: Levantine  Confidence: 0.0319
Dialect: Egyptian   Confidence: 0.0023
Dialect: Gulf       Confidence: 0.0019
Dialect: Maghrebi   Confidence: 0.0003

Once again, the model makes the correct prediction 🎉

⚠️ Caution: Make sure your audio is sampled at 16kHz. If not, you should use librosa or torch to resample the audio.

Info ℹ️

Developed by: Badr M. Abdullah and Matthew Baas
Model type: wav2vec 2.0 architecture
Language: Arabic (and its varieties)
License: Creative Commons Attribution 4.0 (CC BY 4.0)
Finetuned from model: MMS-300m [https://huggingface.co/facebook/mms-300m]

Training Data 🛢️

Trained on the MGB-3 ADI-5 dataset, which consists of TV Broadcast speech from Al Jazeera TV (news, interviews, discussions, TV shows, etc.)

Evaluation 📈

The model has been evaluated on the challenging multi-domain MADIS-5 benchmark. The model performed very well in our evaluation and is expected it to be robust to real-world speech samples.

Out-of-Scope Use ⛔

The model should not be used to

Assess fluency or nativeness of speech
Determine whether the speaker uses a formal or informal register
Make judgments about a speaker's origin, education level, or socioeconomic status
Filter or discriminate against speakers based on dialect

Bias, Risks, and Limitations ⚠️

Some Arabic varieties are not well-represented in the training data. The model may not work well for some dialects such as Yemeni Arabic, Iraqi Arabic, and Saharan Arabic.

Additional limitations include:

Very short audio samples (< 2 seconds) may not provide enough information for accurate classification
Code-switching between dialects (specially mixing with MSA) may result in less reliable classifications
Speakers who have lived in multiple dialect regions may exhibit mixed features
Speech from non-typical speakers such as children and people with speech disorders might be challenging for the model

Recommendations 👌

For optimal results, use audio segments of at least 5-10 seconds
Confidence scores may not always be informative (e.g., the model could make a wrong decision but still very confident)
For critical applications, consider human verification of model predictions

Citation ✒️

If you use this dataset in your research, please cite our paper:

BibTeX:

@inproceedings{abdullah2025voice,
  title={Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification},
  author={Badr M. Abdullah and Matthew Baas and Bernd Möbius and Dietrich Klakow},
  year={2025},
  publisher={Interspeech},
  url={https://arxiv.org/pdf/2505.24713}
}

Model Card Contact 📧

If you have any questions, please do not hesitate to write an email to badr dot nlp at gmail dot com 😊

Downloads last month: 473

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for badrex/mms-300m-arabic-dialect-identifier

Base model

facebook/mms-300m

Finetuned

(17)

this model

Space using badrex/mms-300m-arabic-dialect-identifier 1

Evaluation results

accuray on Cross-domain ADI in Speech
MADIS-5

80.730

View on Papers With Code