🔍 Arabic AI Text Detection Model

📖 Model Description

Arabic AI vs Human Text Detection Model - Fine-tuned AraBERT

This model is specifically designed to detect AI-generated text in Arabic language. It's fine-tuned from aubmindlab/bert-base-arabertv2 and can distinguish between:

🧑 Human-written Arabic text (label: 0, "HUMAN")
🤖 AI-generated Arabic text (label: 1, "AI")

The model was trained using advanced validation techniques with early stopping to ensure optimal performance and prevent overfitting.

🎯 Intended Use

Primary Use Cases

Content Verification: Verify authenticity of Arabic articles and posts
Academic Integrity: Detect AI-generated essays and assignments
Social Media Monitoring: Identify automated Arabic content
Research: Benchmark for Arabic AI detection studies
Content Moderation: Flag potentially AI-generated Arabic text

Supported Text Types

📰 News Articles (Modern Standard Arabic)
📝 Essays and Academic Writing
💬 Social Media Posts
📚 Blog Posts and Articles
🗞️ Formal and Semi-formal Arabic Text

📊 Performance Metrics

Metric	Score	Description
🎯 Accuracy	95.0%	Overall classification accuracy
⚖️ Precision	95.0%	Precision across both classes
🎪 Recall	94.0%	Recall across both classes
🏆 F1 Score	94.0%	Harmonic mean of precision and recall

Evaluated on a balanced validation set with equal human and AI-generated Arabic texts.

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

# Method 1: Using pipeline (Recommended)
classifier = pipeline(
    "text-classification",
    model="sabaridsnfuji/arabic-ai-text-detector",
    tokenizer="sabaridsnfuji/arabic-ai-text-detector"
)

# Test with Arabic text
arabic_text = "هذا مثال على نص باللغة العربية"
result = classifier(arabic_text)

print(f"Prediction: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.2%}")

Advanced Usage

# Method 2: Manual prediction with probabilities
model_name = "sabaridsnfuji/arabic-ai-text-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def predict_arabic_text(text):
    # Tokenize
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        truncation=True, 
        max_length=512,
        padding=True
    )
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    # Get results
    predicted_class = torch.argmax(probabilities, dim=1).item()
    confidence = torch.max(probabilities, dim=1)[0].item()
    
    labels = {0: "HUMAN", 1: "AI"}
    
    return {
        "prediction": labels[predicted_class],
        "confidence": confidence,
        "probabilities": {
            "human": probabilities[0][0].item(),
            "ai": probabilities[0][1].item()
        }
    }

# Example usage
text = "النص العربي المراد تصنيفه هنا"
result = predict_arabic_text(text)
print(result)

Batch Processing

# Process multiple texts efficiently
texts = [
    "النص الأول باللغة العربية",
    "النص الثاني للتصنيف", 
    "المزيد من النصوص العربية"
]

results = classifier(texts)
for text, result in zip(texts, results):
    print(f"Text: {text[:50]}...")
    print(f"Prediction: {result['label']} ({result['score']:.2%})")
    print("-" * 50)

🏗️ Model Architecture

AraBERT-v2 Base Architecture
├── 📥 Input: Arabic text (max 512 tokens)
├── 🔤 Tokenizer: AraBERT Arabic tokenizer  
├── 🧠 Encoder: 12-layer Transformer (110M parameters)
├── 🎯 Classifier: Linear layer (768 → 2 classes)
└── 📤 Output: [Human, AI] classification + probabilities

🎓 Training Details

Dataset

Size: Custom Arabic AI/Human dataset (4,798 samples)
Language: Arabic (Modern Standard Arabic + dialectal variations)
Balance: 50% human-written, 50% AI-generated
Sources: News articles, essays, social media, academic texts
Split: 80% training, 20% validation

Training Configuration

Base Model: aubmindlab/bert-base-arabertv2 (AraBERT-v2)
Strategy: Step-by-step training with validation loss tracking
Epochs: 3 with early stopping
Batch Size: 8
Learning Rate: 2e-05
Max Sequence Length: 512 tokens
Optimizer: AdamW with weight decay (0.01)
Hardware: GPU training with mixed precision (FP16)

Training Process

Step-by-step training: Model trained in small chunks (0.2 epochs each)
Frequent validation: Evaluation after each training chunk
Best model selection: Saved only when validation loss improved
Early stopping: Prevented overfitting with patience mechanism

📈 Evaluation & Benchmarks

Test Performance

Validation Accuracy: 95.0%
Cross-domain Testing: Tested on various Arabic text sources
Robustness: Evaluated on different writing styles and topics

Comparison with Baselines

Model	Accuracy	Notes
This Model	95.0%	Step-by-step trained AraBERT
GPTZero	62.7%	On AIRABIC benchmark
Random Baseline	50.0%	Random classification

⚠️ Limitations & Considerations

Known Limitations

Text Length: Optimized for texts up to 512 tokens
Domain: Best performance on formal/semi-formal Arabic
Dialects: Primarily trained on Modern Standard Arabic
Temporal: Training data has a specific time cutoff

Potential Biases

Source Bias: Training data may reflect specific domains
Dialectal Bias: May perform differently on regional Arabic varieties
AI Model Bias: Trained primarily on specific AI models' outputs

Recommendations

Best for: News articles, essays, formal Arabic text
Consider carefully for: Informal chat, poetry, technical jargon
Combine with: Human review for critical applications

🛠️ Technical Specifications

Model Details

Architecture: BERT-based binary classifier
Parameters: ~110M total parameters
Model Size: ~440MB
Precision: FP16 optimized for inference
Inference Speed: ~50ms per text (GPU), ~200ms (CPU)

Input/Output Specification

# Input
{
    "text": "النص العربي المراد تصنيفه",
    "max_length": 512
}

# Output  
{
    "label": "HUMAN" | "AI",
    "score": 0.95,  # Confidence score
    "probabilities": {
        "HUMAN": 0.95,
        "AI": 0.05
    }
}

🔬 Usage Examples

Example 1: News Article Detection

news_text = '''
أعلنت وزارة التعليم عن إطلاق برنامج جديد لتطوير المناهج الدراسية 
في المرحلة الثانوية، والذي يهدف إلى تعزيز مهارات الطلاب في التفكير 
النقدي والإبداع. ويأتي هذا البرنامج ضمن رؤية 2030 لتطوير التعليم.
'''

result = classifier(news_text)
# Expected: HUMAN (news articles are typically human-written)

Example 2: AI-Generated Text Detection

ai_text = '''
في هذا المقال، سنناقش موضوع التكنولوجيا. التكنولوجيا مهمة جداً في 
حياتنا. يجب أن نفهم التكنولوجيا بشكل صحيح. التكنولوجيا تساعدنا كثيراً.
'''

result = classifier(ai_text)
# Expected: AI (repetitive patterns typical of AI generation)

📚 Citation

If you use this model in your research or applications, please cite:

@misc{sabaridsnfuji-arabic-ai-detector-20250730,
  title={Arabic AI Text Detection Model},
  author={sabaridsnfuji},
  year={2025},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/sabaridsnfuji/arabic-ai-text-detector}}
}

🤝 Contributing & Feedback

Model Issues: Please report issues in the discussions tab
Improvements: Suggestions for model improvements are welcome
Collaborations: Open to research collaborations in Arabic NLP

📄 License

This model is released under the Apache 2.0 license. You are free to:

✅ Use commercially
✅ Modify and distribute
✅ Use in research
✅ Include in applications

🙏 Acknowledgments

Base Model: aubmindlab/bert-base-arabertv2
Framework: Hugging Face Transformers
Infrastructure: Google Colab for training
Community: Arabic NLP research community

📞 Contact

🤗 Hugging Face: sabaridsnfuji
📧 Issues: Use the repository discussions for questions
🔗 Model Page: https://huggingface.co/sabaridsnfuji/arabic-ai-text-detector

🌟 If this model helps your work, please give it a ⭐ star! 🌟

Built with ❤️ for the Arabic NLP community

Last updated: 2025-07-30

sabaridsnfuji
/

arabic-ai-text-detector