monsifnadir's picture
Update README.md
884deeb verified
metadata
language: ar
tags:
  - darija
  - moroccan-arabic
  - sentiment-analysis
  - text-classification
  - fine-tuned
  - tweets
license: apache-2.0
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - cohen_kappa

DarijaBERT Fine-Tuned for Sentiment Analysis ๐Ÿ‡ฒ๐Ÿ‡ฆ๐Ÿง 

This sentiment analysis model is based on DarijaBERT, a language model pretrained on Moroccan Arabic (Darija) text.
The model has been fine-tuned to classify Moroccan Arabic tweets and public comments into three sentiment categories:

  • Positive (2)
  • Neutral (0)
  • Negative (1)

๐Ÿ›  Model Architecture

The base DarijaBERT architecture was extended with:

  • Two fully connected layers of 1024 neurons each
  • Dropout layer (p=0.3) to enhance generalization
  • Final classification layer with 3 output neurons (one for each sentiment class)

๐Ÿง  Pretraining Details

  • Dataset: 17,441 Moroccan tweets
    • 9,894 positive tweets (56.73%)
    • 4,039 neutral tweets (23.16%)
    • 3,508 negative tweets (20.11%)
  • Training Framework: Hugging Face Trainer API
  • Hyperparameters:
    • Learning rate: 1e-5
    • Batch size: 16 (with gradient accumulation = 32)
    • Weight decay: 0.01
    • EarlyStoppingCallback: Training stopped automatically at 92% accuracy
    • Epochs: Up to 20
  • Evaluation Strategy: Evaluated after every epoch, best model saved

Performance:

  • Accuracy: 87%
  • F1 Score: 87%
  • Precision: 88%
  • Cohen's Kappa: 0.80

๐Ÿ”ฅ Fine-Tuning Details (Cash Transfer Public Policy 2023)

  • Dataset: 1,344 Moroccan comments from YouTube and Hespress
    • 515 neutral
    • 505 negative
    • 324 positive
  • Split: 80% training / 20% testing
  • Hyperparameters:
    • Learning rate: 5e-6
    • Batch size: 32
    • Maximum sequence length: 256 tokens
    • Warmup ratio: 0.1
    • Early stopping enabled
    • Class weights adjusted for imbalance

Performance:

  • Accuracy: 91.6%
  • Precision: 0.916
  • Recall: 0.916
  • F1 Score: 0.916
  • Cohenโ€™s Kappa: 0.872

๐Ÿ“ฅ How to Use the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("monsifnadir/DarijaBERT-For-Sentiment-Analysis")
tokenizer = AutoTokenizer.from_pretrained("monsifnadir/DarijaBERT-For-Sentiment-Analysis")

text = "ูุฑุญุช ุจุฒุงู ุงู„ูŠูˆู… ุงู„ุญู…ุฏ ู„ู„ู‡"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=-1).item()

# Map prediction to label
label_map = {0: "Neutral", 1: "Negative", 2: "Positive"}
print("Predicted Sentiment:", label_map[predicted_class])