Saudi Dialect Text Classifier

Model Description

This model is a text classification model fine-tuned to classify Saudi dialect text into 43 different categories.

Purpose: To categorize Saudi dialect text into predefined topics or intents.
Language: Arabic (Saudi Dialect)
Base Model: Omartificial-Intelligence-Space/SA-BERT-V1
Dataset: AI-Diploma/saudi-dialect-classification-train

Intended uses

This model is intended for text classification tasks specifically on text written in the Saudi dialect of Arabic. Potential use cases include:

Topic classification of social media posts.
Categorization of customer feedback in Saudi Arabia.
Analyzing sentiment or intent in Saudi dialect conversations.

Training Procedure

The model was fine-tuned from the pretrained Omartificial-Intelligence-Space/SA-BERT-V1 model.

Dataset: AI-Diploma/saudi-dialect-classification-train
Training Data Size: 1065 examples
Epochs: 10
Learning Rate: 0.0001
Train Batch Size: 8
Optimizer: AdamW with weight decay (0.01)

Evaluation Results

The model was evaluated using the following metrics:

Accuracy
Weighted F1-score

(Note: Specific evaluation results on a separate test set were not included in the provided training script but can be added here if available.)

How to use

You can use this model with the transformers library for inference.

from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="AI-Diploma/AlWaleed2_Saudi_Classifier")

# Example usage
text = "هذا مثال لنص باللهجة السعودية"
result = classifier(text)
print(result)

Limitations and Bias

Dataset Size: The training dataset contains 1065 examples across 43 categories. A larger and more diverse dataset could potentially improve performance and cover more linguistic variations.
Dialect Coverage: The model is specifically trained on Saudi dialect and may not perform as well on other Arabic dialects.
Potential Bias: Like all language models, this model may inherit biases present in the training data.
Category Imbalance: The distribution of examples across the 43 categories should be examined for potential class imbalance, which could affect performance on minority classes.

Label Mapping

The model uses the following mapping from label ID to category name:

{0: 'Arts and Media', 1: 'Business and Money', 2: 'Cars and Driving', 3: 'Culture and Traditions', 4: 'Daily Life', 5: 'Days and Dates', 6: 'Descriptions', 7: 'Directions', 8: 'Economy and Finance', 9: 'Education and Training', 10: 'Emotions', 11: 'Entertainment', 12: 'Environment and Nature', 13: 'Events and Celebrations', 14: 'Family and Relationships', 15: 'Fitness and Exercise', 16: 'Food and Dining', 17: 'Greetings', 18: 'Health', 19: 'Hobbies and Interests', 20: 'Home and Living', 21: 'Instructions and Guidelines', 22: 'Law and Justice', 23: 'Mental Health', 24: 'Opinions', 25: 'Planning and Decisions', 26: 'Questions', 27: 'Real Estate and Housing', 28: 'Religion and Spirituality', 29: 'Responses', 30: 'Saudi Cities and Regions', 31: 'Shopping', 32: 'Social Media', 33: 'Sports', 34: 'Study and Education', 35: 'Technology', 36: 'Time', 37: 'Tourism', 38: 'Transport', 39: 'Travel', 40: 'Weather', 41: 'Wishes and Dreams', 42: 'Work'}

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32