Saudi Dialect Text Classifier
Model Description
This model is a text classification model fine-tuned to classify Saudi dialect text into 43 different categories.
- Purpose: To categorize Saudi dialect text into predefined topics or intents.
- Language: Arabic (Saudi Dialect)
- Base Model: Omartificial-Intelligence-Space/SA-BERT-V1
- Dataset: AI-Diploma/saudi-dialect-classification-train
Intended uses
This model is intended for text classification tasks specifically on text written in the Saudi dialect of Arabic. Potential use cases include:
- Topic classification of social media posts.
- Categorization of customer feedback in Saudi Arabia.
- Analyzing sentiment or intent in Saudi dialect conversations.
Training Procedure
The model was fine-tuned from the pretrained Omartificial-Intelligence-Space/SA-BERT-V1 model.
- Dataset: AI-Diploma/saudi-dialect-classification-train
- Training Data Size: 1065 examples
- Epochs: 10
- Learning Rate: 0.0001
- Train Batch Size: 8
- Optimizer: AdamW with weight decay (0.01)
Evaluation Results
The model was evaluated using the following metrics:
- Accuracy
- Weighted F1-score
(Note: Specific evaluation results on a separate test set were not included in the provided training script but can be added here if available.)
How to use
You can use this model with the transformers library for inference.
from transformers import pipeline
# Load the pipeline
classifier = pipeline("text-classification", model="AI-Diploma/AlWaleed2_Saudi_Classifier")
# Example usage
text = "هذا مثال لنص باللهجة السعودية"
result = classifier(text)
print(result)
Limitations and Bias
- Dataset Size: The training dataset contains 1065 examples across 43 categories. A larger and more diverse dataset could potentially improve performance and cover more linguistic variations.
- Dialect Coverage: The model is specifically trained on Saudi dialect and may not perform as well on other Arabic dialects.
- Potential Bias: Like all language models, this model may inherit biases present in the training data.
- Category Imbalance: The distribution of examples across the 43 categories should be examined for potential class imbalance, which could affect performance on minority classes.
Label Mapping
The model uses the following mapping from label ID to category name:
{0: 'Arts and Media', 1: 'Business and Money', 2: 'Cars and Driving', 3: 'Culture and Traditions', 4: 'Daily Life', 5: 'Days and Dates', 6: 'Descriptions', 7: 'Directions', 8: 'Economy and Finance', 9: 'Education and Training', 10: 'Emotions', 11: 'Entertainment', 12: 'Environment and Nature', 13: 'Events and Celebrations', 14: 'Family and Relationships', 15: 'Fitness and Exercise', 16: 'Food and Dining', 17: 'Greetings', 18: 'Health', 19: 'Hobbies and Interests', 20: 'Home and Living', 21: 'Instructions and Guidelines', 22: 'Law and Justice', 23: 'Mental Health', 24: 'Opinions', 25: 'Planning and Decisions', 26: 'Questions', 27: 'Real Estate and Housing', 28: 'Religion and Spirituality', 29: 'Responses', 30: 'Saudi Cities and Regions', 31: 'Shopping', 32: 'Social Media', 33: 'Sports', 34: 'Study and Education', 35: 'Technology', 36: 'Time', 37: 'Tourism', 38: 'Transport', 39: 'Travel', 40: 'Weather', 41: 'Wishes and Dreams', 42: 'Work'}
- Downloads last month
- 3
