Text Classification
Transformers
Safetensors
Arabic
bert

This is a finetuned version of the MARBERTv2 transformer for arabic dialect classification using the QADI dataset. This model classifies input text into one of 5 Arabic dialects which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi. For the 18 dialects version click here: https://huggingface.co/oahmedd/MARBERTv2-Finetuned-on-QADI-dataset

Dataset:

QADI (Qatar Arabic Dialect Identification

440,000 tweets

https://huggingface.com/datasets/Abdelrahman-Rezk/Arabic_Dialect_Identification by Abdelrahman Rezk et al.

Covers 18 Arabic dialects

We introduced a new column in the dataset 'arabic_region' which groups the 18 dialects into 5 Arabic regions which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi

Evaluation:

Metrics: 85% Accuracy & F1-score

This table shows the model’s F1 performance after grouping 18 Arabic dialects into 5 major regional classes:

Arabic Region F1 Score
Gulf 89.0
Egyptian 83.9
Levantine 83.3
Iraqi 70.6
Maghrebi 81.7

Usage:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")
tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")

For Dialect classification Inference:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

DIALECT_LABELS = ["Gulf", "Egyptian", "Levantine", "Iraqi", "Maghrebi"]

model = AutoModelForSequenceClassification.from_pretrained(
  "oahmedd/MARBERTv2-Finetuned-on-QADI-dataset",
  num_labels=5
)
tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")

model.eval()

 text = "ازيك يصاحبي عامل ايه، ايه الاخبار"

inputs = tokenizer(
  text,
  return_tensors="pt",
  truncation=True,
  padding=True,
)

with torch.inference_mode():
  logits = model(**inputs).logits
  prediction = torch.argmax(logits, dim=1).item()
predicted_dialect = DIALECT_LABELS[prediction]

print(f"Predicted Dialect: {predicted_dialect}")

Citation:

If you find this helpful, please cite our work.

@article{essameldin2025arabic,
  title={Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis},
  author={Essameldin, Omar A and Elbeih, Ali O and Gomaa, Wael H and Elsersy, Wael F},
  journal={arXiv preprint arXiv:2506.19753},
  year={2025}
}
Downloads last month
9
Safetensors
Model size
163M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI

Base model

UBC-NLP/MARBERTv2
Finetuned
(21)
this model

Dataset used to train oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI