--- language: "ar" tags: - darija - moroccan-arabic - sentiment-analysis - text-classification - fine-tuned - tweets license: apache-2.0 metrics: - accuracy - f1 - precision - recall - cohen_kappa --- # DarijaBERT Fine-Tuned for Sentiment Analysis ๐Ÿ‡ฒ๐Ÿ‡ฆ๐Ÿง  This sentiment analysis model is based on **DarijaBERT**, a language model pretrained on Moroccan Arabic (Darija) text. The model has been **fine-tuned** to classify Moroccan Arabic tweets and public comments into three sentiment categories: - **Positive** (2) - **Neutral** (0) - **Negative** (1) --- ## ๐Ÿ›  Model Architecture The base DarijaBERT architecture was **extended** with: - Two fully connected layers of **1024 neurons each** - **Dropout layer (p=0.3)** to enhance generalization - Final classification layer with **3 output neurons** (one for each sentiment class) --- ## ๐Ÿง  Pretraining Details - **Dataset**: 17,441 Moroccan tweets - 9,894 positive tweets (56.73%) - 4,039 neutral tweets (23.16%) - 3,508 negative tweets (20.11%) - **Training Framework**: Hugging Face Trainer API - **Hyperparameters**: - Learning rate: `1e-5` - Batch size: `16` (with gradient accumulation = 32) - Weight decay: `0.01` - EarlyStoppingCallback: Training stopped automatically at **92% accuracy** - Epochs: Up to 20 - **Evaluation Strategy**: Evaluated after every epoch, best model saved **Performance**: - Accuracy: **87%** - F1 Score: **87%** - Precision: **88%** - Cohen's Kappa: **0.80** --- ## ๐Ÿ”ฅ Fine-Tuning Details (Cash Transfer Public Policy 2023) - **Dataset**: 1,344 Moroccan comments from YouTube and Hespress - 515 neutral - 505 negative - 324 positive - **Split**: 80% training / 20% testing - **Hyperparameters**: - Learning rate: `5e-6` - Batch size: `32` - Maximum sequence length: `256 tokens` - Warmup ratio: `0.1` - Early stopping enabled - Class weights adjusted for imbalance **Performance**: - Accuracy: **91.6%** - Precision: **0.916** - Recall: **0.916** - F1 Score: **0.916** - Cohenโ€™s Kappa: **0.872** --- ## ๐Ÿ“ฅ How to Use the Model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("monsifnadir/DarijaBERT-For-Sentiment-Analysis") tokenizer = AutoTokenizer.from_pretrained("monsifnadir/DarijaBERT-For-Sentiment-Analysis") text = "ูุฑุญุช ุจุฒุงู ุงู„ูŠูˆู… ุงู„ุญู…ุฏ ู„ู„ู‡" inputs = tokenizer(text, return_tensors="pt", truncation=True) outputs = model(**inputs) predicted_class = outputs.logits.argmax(dim=-1).item() # Map prediction to label label_map = {0: "Neutral", 1: "Negative", 2: "Positive"} print("Predicted Sentiment:", label_map[predicted_class])