Upload bot detection model - 2025-08-23 15:59

Browse files

Files changed (11) hide show

README.md +205 -0
config.json +27 -0
inference_example.py +43 -0
merges.txt +0 -0
model.safetensors +3 -0
special_tokens_map.json +15 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
training_args.bin +3 -0
training_info.json +75 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,205 @@

+---
+language: en
+license: mit
+tags:
+- text-classification
+- bot-detection
+- social-media
+- distilroberta
+- pytorch
+- transformers
+datasets:
+- custom
+widget:
+- text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123"
+  example_title: "Promotional Bot Text"
+- text: "Just finished reading an interesting article about machine learning applications in healthcare."
+  example_title: "Human-like Text"
+- text: "Follow for follow? Like my posts and I'll like yours back! 💯"
+  example_title: "Social Media Bot"
+- text: "Had a wonderful dinner with my family tonight. These moments are precious."
+  example_title: "Authentic Human Text"
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+model-index:
+- name: distilroberta-bot-detection
+  results:
+  - task:
+      type: text-classification
+      name: Bot Detection
+    metrics:
+    - type: accuracy
+      value: 0.9423
+      name: Test Accuracy
+    - type: f1
+      value: 0.9424
+      name: Test F1-Score (Weighted)
+    - type: precision
+      value: 0.9428
+      name: Test Precision (Weighted)
+    - type: recall
+      value: 0.9423
+      name: Test Recall (Weighted)
+---
+# Bot Detection Model - DistilRoBERTa
+## Model Description
+This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.
+## Performance
+### Cross-Validation Results (5-Fold)
+| Metric | Mean ± Std | Range |
+|--------|------------|-------|
+| **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 |
+| **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 |
+| **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 |
+### Test Set Performance
+- **Accuracy**: 0.9423
+- **F1-Score (Weighted)**: 0.9424
+- **Precision (Weighted)**: 0.9428
+- **Recall (Weighted)**: 0.9423
+- **Inference Speed**: 232.83 samples/second
+## Usage
+### Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import re
+# Load model and tokenizer
+model_name = "junaid1993/distilroberta-bot-detection"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+def preprocess_text(text):
+    """Clean text for bot detection"""
+    if not isinstance(text, str):
+        return ""
+    # Remove URLs
+    text = re.sub(r'http\S+|www\.\S+', '', text)
+    # Remove @ and # symbols
+    text = re.sub(r'[@#]', '', text)
+    # Remove punctuation and special characters
+    text = re.sub(r'[^\w\s]', '', text)
+    # Remove numbers
+    text = re.sub(r'\d+', '', text)
+    # Clean whitespace
+    text = re.sub(r'\s+', ' ', text).strip()
+    return text.lower()
+def predict_bot(text, threshold=0.5):
+    """Predict if text is bot-generated"""
+    clean_text = preprocess_text(text)
+    if not clean_text:
+        return {"prediction": "unknown", "confidence": 0.5}
+    inputs = tokenizer(
+        clean_text,
+        return_tensors="pt",
+        truncation=True,
+        padding=True,
+        max_length=512
+    )
+    with torch.no_grad():
+        outputs = model(**inputs)
+        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    bot_prob = probabilities[0][1].item()
+    prediction = "bot" if bot_prob > threshold else "human"
+    return {
+        "prediction": prediction,
+        "bot_probability": round(bot_prob, 4),
+        "human_probability": round(probabilities[0][0].item(), 4)
+    }
+# Example usage
+text = "🔥 AMAZING DEAL! Click here now!"
+result = predict_bot(text)
+print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
+```
+## Training Details
+### Model Architecture
+- **Base Model**: distilroberta-base
+- **Task**: Binary sequence classification
+- **Classes**: Human (0) vs Bot (1)
+- **Parameters**: ~82M parameters
+### Training Configuration
+- **Epochs**: 10 (with early stopping)
+- **Batch Size**: 2 per device, gradient accumulation steps: 8
+- **Learning Rate**: Automatic (AdamW optimizer)
+- **Weight Decay**: 0.01
+- **Mixed Precision**: FP16
+- **Class Weighting**: Applied to handle dataset imbalance
+### Data Preprocessing
+1. URL removal
+2. Special character cleaning (@ symbols, hashtags)
+3. Punctuation removal
+4. Number removal
+5. Whitespace normalization
+6. Lowercase conversion
+### Validation Methodology
+- **Cross-Validation**: 5-fold Stratified K-Fold
+- **Test Split**: 20% holdout set
+- **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro)
+## Limitations
+- **Domain**: Primarily trained on social media text patterns
+- **Language**: English text only
+- **Temporal**: Bot patterns may evolve over time, requiring retraining
+- **Context**: Performance may vary with text length and complexity
+## Intended Use
+This model is designed for:
+- Social media content moderation
+- Academic research on bot detection
+- Content analysis and verification
+## Ethical Considerations
+- This model should be used responsibly and not for harassment
+- Results should be interpreted with appropriate confidence thresholds
+- Human oversight is recommended for critical decisions
+- Regular model updates may be needed as bot techniques evolve
+## Citation
+```bibtex
+@model{distilroberta-bot-detection-2024,
+  title={Bot Detection Model using DistilRoBERTa},
+  author={Junaid},
+  year={2024},
+  publisher={Hugging Face},
+  url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
+}
+```
+## License
+MIT License
+---
+**Model Card Created**: 2025-08-23
+**Framework**: PyTorch + Transformers
+**Validation**: 5-Fold Cross-Validation with Class Weighting

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.55.2",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

inference_example.py ADDED Viewed

	@@ -0,0 +1,43 @@

+# Simple Inference Example
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import re
+# Load model
+tokenizer = AutoTokenizer.from_pretrained("junaid1993/distilroberta-bot-detection")
+model = AutoModelForSequenceClassification.from_pretrained("junaid1993/distilroberta-bot-detection")
+def preprocess_text(text):
+    if not isinstance(text, str):
+        return ""
+    text = re.sub(r'http\S+|www\.\S+', '', text)
+    text = re.sub(r'[@#]', '', text)
+    text = re.sub(r'[^\w\s]', '', text)
+    text = re.sub(r'\d+', '', text)
+    text = re.sub(r'\s+', ' ', text).strip()
+    return text.lower()
+def predict_bot(text):
+    clean_text = preprocess_text(text)
+    inputs = tokenizer(clean_text, return_tensors="pt", truncation=True, max_length=512)
+    with torch.no_grad():
+        outputs = model(**inputs)
+        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    bot_prob = probabilities[0][1].item()
+    prediction = "Bot" if bot_prob > 0.5 else "Human"
+    return {"prediction": prediction, "bot_probability": bot_prob}
+# Example usage
+examples = [
+    "🔥 AMAZING DEAL! Get 90% OFF now!",
+    "Just finished reading a great book about AI."
+]
+for text in examples:
+    result = predict_bot(text)
+    print(f"Text: {text}")
+    print(f"Prediction: {result['prediction']} ({result['bot_probability']:.3f})")
+    print("-" * 50)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb6005d01fca73198876b7048d1d6cff380011e6a72779ce4285856951e1fa05
+size 328492280

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50264": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f15cbd61cefd39e9a728c08ffcb3a729d0182a60e8d96281339f9800bbedc8e0
+size 5368

training_info.json ADDED Viewed

	@@ -0,0 +1,75 @@

+{
+  "model_info": {
+    "model_type": "distilroberta-base",
+    "task": "binary_classification",
+    "classes": [
+      "human",
+      "bot"
+    ],
+    "num_parameters": "82M",
+    "framework": "transformers",
+    "pytorch_version": ">=1.12.0"
+  },
+  "training_methodology": {
+    "method": "class_weighted_cross_validation",
+    "cv_folds": 5,
+    "cv_strategy": "stratified",
+    "early_stopping": true,
+    "early_stopping_patience": 3,
+    "mixed_precision": "fp16"
+  },
+  "hyperparameters": {
+    "batch_size_per_device": 2,
+    "gradient_accumulation_steps": 8,
+    "max_epochs": 10,
+    "weight_decay": 0.01,
+    "optimizer": "AdamW"
+  },
+  "performance_summary": {
+    "cv_metrics": {
+      "accuracy": {
+        "mean": 0.9433,
+        "std": 0.0052,
+        "min": 0.9385,
+        "max": 0.9497
+      },
+      "f1_weighted": {
+        "mean": 0.9434,
+        "std": 0.0051,
+        "min": 0.9387,
+        "max": 0.9497
+      },
+      "f1_macro": {
+        "mean": 0.9419,
+        "std": 0.0052,
+        "min": 0.9371,
+        "max": 0.9483
+      },
+      "precision_weighted": {
+        "mean": 0.9444,
+        "std": 0.0045,
+        "min": 0.9397,
+        "max": 0.9498
+      },
+      "recall_weighted": {
+        "mean": 0.9433,
+        "std": 0.0052,
+        "min": 0.9385,
+        "max": 0.9497
+      }
+    },
+    "test_metrics": {
+      "loss": 0.1511,
+      "accuracy": 0.9423,
+      "precision_weighted": 0.9428,
+      "recall_weighted": 0.9423,
+      "f1_weighted": 0.9424,
+      "precision_macro": 0.9393,
+      "recall_macro": 0.9427,
+      "f1_macro": 0.9409,
+      "runtime": 121.6927,
+      "samples_per_second": 232.832,
+      "steps_per_second": 8.316
+    }
+  }
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff