metadata

language: ar
tags:
  - darija
  - moroccan-arabic
  - sentiment-analysis
  - text-classification
  - fine-tuned
  - tweets
license: apache-2.0
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - cohen_kappa

DarijaBERT Fine-Tuned for Sentiment Analysis 🇲🇦🧠

This sentiment analysis model is based on DarijaBERT, a language model pretrained on Moroccan Arabic (Darija) text.
The model has been fine-tuned to classify Moroccan Arabic tweets and public comments into three sentiment categories:

Positive (2)
Neutral (0)
Negative (1)

🛠 Model Architecture

The base DarijaBERT architecture was extended with:

Two fully connected layers of 1024 neurons each
Dropout layer (p=0.3) to enhance generalization
Final classification layer with 3 output neurons (one for each sentiment class)

🧠 Pretraining Details

Dataset: 17,441 Moroccan tweets
- 9,894 positive tweets (56.73%)
- 4,039 neutral tweets (23.16%)
- 3,508 negative tweets (20.11%)
Training Framework: Hugging Face Trainer API
Hyperparameters:
- Learning rate: 1e-5
- Batch size: 16 (with gradient accumulation = 32)
- Weight decay: 0.01
- EarlyStoppingCallback: Training stopped automatically at 92% accuracy
- Epochs: Up to 20
Evaluation Strategy: Evaluated after every epoch, best model saved

Performance:

Accuracy: 87%
F1 Score: 87%
Precision: 88%
Cohen's Kappa: 0.80

🔥 Fine-Tuning Details (Cash Transfer Public Policy 2023)

Dataset: 1,344 Moroccan comments from YouTube and Hespress
- 515 neutral
- 505 negative
- 324 positive
Split: 80% training / 20% testing
Hyperparameters:
- Learning rate: 5e-6
- Batch size: 32
- Maximum sequence length: 256 tokens
- Warmup ratio: 0.1
- Early stopping enabled
- Class weights adjusted for imbalance

Performance:

Accuracy: 91.6%
Precision: 0.916
Recall: 0.916
F1 Score: 0.916
Cohen’s Kappa: 0.872

📥 How to Use the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("monsifnadir/DarijaBERT-For-Sentiment-Analysis")
tokenizer = AutoTokenizer.from_pretrained("monsifnadir/DarijaBERT-For-Sentiment-Analysis")

text = "فرحت بزاف اليوم الحمد لله"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()

# Map prediction to label
label_map = {0: "Neutral", 1: "Negative", 2: "Positive"}
print("Predicted Sentiment:", label_map[predicted_class])