✍🏻 MARBERTv2 Arabic Written Dialect Classifier

Model Overview

This model is a fine-tuned version of UBC-NLP/MARBERTv2 for Arabic written dialect classification. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text.

This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems.


📌 Model Details

This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

  • MAGHREB (North African dialects)
  • LEV (Levantine dialects)
  • MSA (Modern Standard Arabic)
  • GLF (Gulf dialects)
  • EGY (Egyptian Arabic)

It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.


📊 Labels (id2label)

The model predicts one of the following five classes:

{
  "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
  "1": "LEV",     // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
  "2": "MSA",     // Modern Standard Arabic
  "3": "GLF",     // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
  "4": "EGY",      // Egyptian dialect
}

📚 Training Data

The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.

Distribution by Dialect:

Dialect Count
GLF 253,553
LEV 243,025
MAGHREB 140,887
EGY 105,226
MSA 83,231

⚙️ Training Details

  • Architecture: MARBERTv2 (BERT-based)
  • Task: Text Classification (Dialect Identification)
  • Objective: Multi-class classification with softmax over 5 dialect classes
  • Tokenizer: UBC-NLP/MARBERTv2

📂 Datasets Used

Below is a detailed overview of the datasets used in training and/or considered during development:

Dataset Brief Description Annotation strategy Provided Labels Current SOTA Performance
MADAR Subtask-1 (MADAR-6) A Collection of parallel sentences (BTEC) covering the dialects of 5 cities from the Arab World and MSA in the travel domain (10,000 sentences per city) Manual 5 Arab Cities + MSA 92.5% Accuracy
MADAR Subtask-1 (MADAR-26) A Collection of parallel sentences (BTEC) covering the dialects of 25 cities from the Arab World and MSA in the travel domain (2,000 sentences per city) Manual 25 Arab Cities + MSA 67.32% F1-Score
DART 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects Manual 5 Arab Regions UNK
ArSarcasm v1 10,547 tweets from ASTD and SemEval datasets for Sarcasm detection with the dilaect information added in Manual 4 Arab Regions + MSA UNK
ArSarcasm v2 ArSarcasm-v2 dataset contains 15,548 Tweets and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets) Manual 4 Arab Regions + MSA UNK
IADD Five publicly available corpora were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC) ________ 5 Regions and 9 Countries UNK
QADI 540k tweets (30k per country on average) with a total of 8.8M words Automatic 18 Arab Countries 60.6%
AOC The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY Manual 3 Arab Regions + MSA UNK
NADI-2020 25,957 Tweets from 100 Arab provinces and 21 Arab countries Automatic 100 Prov. and 21 Coun. 6.39% - 26.78%

💡 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة"
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.inference_mode():
    logits = model(**inputs).logits

pred = torch.argmax(logits, dim=-1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")

✨ Acknowledgements

  • MARBERTv2 team at UBC-NLP
  • Contributors of the Arabic dialect datasets used in training

📝 Citation

If you use this model in your research or application, please cite:

@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
  author = {Ibrahim Amin},
  title = {MARBERTv2 Arabic Written Dialect Classifier},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}
Downloads last month
267
Safetensors
Model size
163M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IbrahimAmin/marbertv2-arabic-written-dialect-classifier

Base model

UBC-NLP/MARBERTv2
Finetuned
(22)
this model

Datasets used to train IbrahimAmin/marbertv2-arabic-written-dialect-classifier

Spaces using IbrahimAmin/marbertv2-arabic-written-dialect-classifier 2