This is a finetuned version of the MARBERTv2 transformer for arabic dialect classification using the QADI dataset. This model classifies input text into one of 5 Arabic dialects which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi. For the 18 dialects version click here: https://huggingface.co/oahmedd/MARBERTv2-Finetuned-on-QADI-dataset
Dataset:
QADI (Qatar Arabic Dialect Identification
440,000 tweets
https://huggingface.com/datasets/Abdelrahman-Rezk/Arabic_Dialect_Identification by Abdelrahman Rezk et al.
Covers 18 Arabic dialects
We introduced a new column in the dataset 'arabic_region' which groups the 18 dialects into 5 Arabic regions which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi
Evaluation:
Metrics: 85% Accuracy & F1-score
This table shows the model’s F1 performance after grouping 18 Arabic dialects into 5 major regional classes:
Arabic Region | F1 Score |
---|---|
Gulf | 89.0 |
Egyptian | 83.9 |
Levantine | 83.3 |
Iraqi | 70.6 |
Maghrebi | 81.7 |
Usage:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")
tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")
For Dialect classification Inference:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
DIALECT_LABELS = ["Gulf", "Egyptian", "Levantine", "Iraqi", "Maghrebi"]
model = AutoModelForSequenceClassification.from_pretrained(
"oahmedd/MARBERTv2-Finetuned-on-QADI-dataset",
num_labels=5
)
tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")
model.eval()
text = "ازيك يصاحبي عامل ايه، ايه الاخبار"
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
)
with torch.inference_mode():
logits = model(**inputs).logits
prediction = torch.argmax(logits, dim=1).item()
predicted_dialect = DIALECT_LABELS[prediction]
print(f"Predicted Dialect: {predicted_dialect}")
Citation:
If you find this helpful, please cite our work.
@article{essameldin2025arabic,
title={Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis},
author={Essameldin, Omar A and Elbeih, Ali O and Gomaa, Wael H and Elsersy, Wael F},
journal={arXiv preprint arXiv:2506.19753},
year={2025}
}
- Downloads last month
- 9
Model tree for oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI
Base model
UBC-NLP/MARBERTv2