junaid1993 commited on
Commit
9bcdf02
·
verified ·
1 Parent(s): 8fa0f23

Upload bot detection model - 2025-08-23 15:59

Browse files
README.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-classification
6
+ - bot-detection
7
+ - social-media
8
+ - distilroberta
9
+ - pytorch
10
+ - transformers
11
+ datasets:
12
+ - custom
13
+ widget:
14
+ - text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123"
15
+ example_title: "Promotional Bot Text"
16
+ - text: "Just finished reading an interesting article about machine learning applications in healthcare."
17
+ example_title: "Human-like Text"
18
+ - text: "Follow for follow? Like my posts and I'll like yours back! 💯"
19
+ example_title: "Social Media Bot"
20
+ - text: "Had a wonderful dinner with my family tonight. These moments are precious."
21
+ example_title: "Authentic Human Text"
22
+ metrics:
23
+ - accuracy
24
+ - f1
25
+ - precision
26
+ - recall
27
+ model-index:
28
+ - name: distilroberta-bot-detection
29
+ results:
30
+ - task:
31
+ type: text-classification
32
+ name: Bot Detection
33
+ metrics:
34
+ - type: accuracy
35
+ value: 0.9423
36
+ name: Test Accuracy
37
+ - type: f1
38
+ value: 0.9424
39
+ name: Test F1-Score (Weighted)
40
+ - type: precision
41
+ value: 0.9428
42
+ name: Test Precision (Weighted)
43
+ - type: recall
44
+ value: 0.9423
45
+ name: Test Recall (Weighted)
46
+ ---
47
+
48
+ # Bot Detection Model - DistilRoBERTa
49
+
50
+ ## Model Description
51
+
52
+ This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.
53
+
54
+ ## Performance
55
+
56
+ ### Cross-Validation Results (5-Fold)
57
+ | Metric | Mean ± Std | Range |
58
+ |--------|------------|-------|
59
+ | **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 |
60
+ | **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 |
61
+ | **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 |
62
+
63
+ ### Test Set Performance
64
+ - **Accuracy**: 0.9423
65
+ - **F1-Score (Weighted)**: 0.9424
66
+ - **Precision (Weighted)**: 0.9428
67
+ - **Recall (Weighted)**: 0.9423
68
+ - **Inference Speed**: 232.83 samples/second
69
+
70
+ ## Usage
71
+
72
+ ### Quick Start
73
+ ```python
74
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
75
+ import torch
76
+ import re
77
+
78
+ # Load model and tokenizer
79
+ model_name = "junaid1993/distilroberta-bot-detection"
80
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
81
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
82
+
83
+ def preprocess_text(text):
84
+ """Clean text for bot detection"""
85
+ if not isinstance(text, str):
86
+ return ""
87
+
88
+ # Remove URLs
89
+ text = re.sub(r'http\S+|www\.\S+', '', text)
90
+ # Remove @ and # symbols
91
+ text = re.sub(r'[@#]', '', text)
92
+ # Remove punctuation and special characters
93
+ text = re.sub(r'[^\w\s]', '', text)
94
+ # Remove numbers
95
+ text = re.sub(r'\d+', '', text)
96
+ # Clean whitespace
97
+ text = re.sub(r'\s+', ' ', text).strip()
98
+
99
+ return text.lower()
100
+
101
+ def predict_bot(text, threshold=0.5):
102
+ """Predict if text is bot-generated"""
103
+ clean_text = preprocess_text(text)
104
+
105
+ if not clean_text:
106
+ return {"prediction": "unknown", "confidence": 0.5}
107
+
108
+ inputs = tokenizer(
109
+ clean_text,
110
+ return_tensors="pt",
111
+ truncation=True,
112
+ padding=True,
113
+ max_length=512
114
+ )
115
+
116
+ with torch.no_grad():
117
+ outputs = model(**inputs)
118
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
119
+
120
+ bot_prob = probabilities[0][1].item()
121
+ prediction = "bot" if bot_prob > threshold else "human"
122
+
123
+ return {
124
+ "prediction": prediction,
125
+ "bot_probability": round(bot_prob, 4),
126
+ "human_probability": round(probabilities[0][0].item(), 4)
127
+ }
128
+
129
+ # Example usage
130
+ text = "🔥 AMAZING DEAL! Click here now!"
131
+ result = predict_bot(text)
132
+ print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
133
+ ```
134
+
135
+ ## Training Details
136
+
137
+ ### Model Architecture
138
+ - **Base Model**: distilroberta-base
139
+ - **Task**: Binary sequence classification
140
+ - **Classes**: Human (0) vs Bot (1)
141
+ - **Parameters**: ~82M parameters
142
+
143
+ ### Training Configuration
144
+ - **Epochs**: 10 (with early stopping)
145
+ - **Batch Size**: 2 per device, gradient accumulation steps: 8
146
+ - **Learning Rate**: Automatic (AdamW optimizer)
147
+ - **Weight Decay**: 0.01
148
+ - **Mixed Precision**: FP16
149
+ - **Class Weighting**: Applied to handle dataset imbalance
150
+
151
+ ### Data Preprocessing
152
+ 1. URL removal
153
+ 2. Special character cleaning (@ symbols, hashtags)
154
+ 3. Punctuation removal
155
+ 4. Number removal
156
+ 5. Whitespace normalization
157
+ 6. Lowercase conversion
158
+
159
+ ### Validation Methodology
160
+ - **Cross-Validation**: 5-fold Stratified K-Fold
161
+ - **Test Split**: 20% holdout set
162
+ - **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro)
163
+
164
+ ## Limitations
165
+
166
+ - **Domain**: Primarily trained on social media text patterns
167
+ - **Language**: English text only
168
+ - **Temporal**: Bot patterns may evolve over time, requiring retraining
169
+ - **Context**: Performance may vary with text length and complexity
170
+
171
+ ## Intended Use
172
+
173
+ This model is designed for:
174
+ - Social media content moderation
175
+ - Academic research on bot detection
176
+ - Content analysis and verification
177
+
178
+ ## Ethical Considerations
179
+
180
+ - This model should be used responsibly and not for harassment
181
+ - Results should be interpreted with appropriate confidence thresholds
182
+ - Human oversight is recommended for critical decisions
183
+ - Regular model updates may be needed as bot techniques evolve
184
+
185
+ ## Citation
186
+
187
+ ```bibtex
188
+ @model{distilroberta-bot-detection-2024,
189
+ title={Bot Detection Model using DistilRoBERTa},
190
+ author={Junaid},
191
+ year={2024},
192
+ publisher={Hugging Face},
193
+ url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
194
+ }
195
+ ```
196
+
197
+ ## License
198
+
199
+ MIT License
200
+
201
+ ---
202
+
203
+ **Model Card Created**: 2025-08-23
204
+ **Framework**: PyTorch + Transformers
205
+ **Validation**: 5-Fold Cross-Validation with Class Weighting
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 6,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "problem_type": "single_label_classification",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.55.2",
24
+ "type_vocab_size": 1,
25
+ "use_cache": true,
26
+ "vocab_size": 50265
27
+ }
inference_example.py ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Simple Inference Example
2
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
3
+ import torch
4
+ import re
5
+
6
+ # Load model
7
+ tokenizer = AutoTokenizer.from_pretrained("junaid1993/distilroberta-bot-detection")
8
+ model = AutoModelForSequenceClassification.from_pretrained("junaid1993/distilroberta-bot-detection")
9
+
10
+ def preprocess_text(text):
11
+ if not isinstance(text, str):
12
+ return ""
13
+ text = re.sub(r'http\S+|www\.\S+', '', text)
14
+ text = re.sub(r'[@#]', '', text)
15
+ text = re.sub(r'[^\w\s]', '', text)
16
+ text = re.sub(r'\d+', '', text)
17
+ text = re.sub(r'\s+', ' ', text).strip()
18
+ return text.lower()
19
+
20
+ def predict_bot(text):
21
+ clean_text = preprocess_text(text)
22
+ inputs = tokenizer(clean_text, return_tensors="pt", truncation=True, max_length=512)
23
+
24
+ with torch.no_grad():
25
+ outputs = model(**inputs)
26
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
27
+
28
+ bot_prob = probabilities[0][1].item()
29
+ prediction = "Bot" if bot_prob > 0.5 else "Human"
30
+
31
+ return {"prediction": prediction, "bot_probability": bot_prob}
32
+
33
+ # Example usage
34
+ examples = [
35
+ "🔥 AMAZING DEAL! Get 90% OFF now!",
36
+ "Just finished reading a great book about AI."
37
+ ]
38
+
39
+ for text in examples:
40
+ result = predict_bot(text)
41
+ print(f"Text: {text}")
42
+ print(f"Prediction: {result['prediction']} ({result['bot_probability']:.3f})")
43
+ print("-" * 50)
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb6005d01fca73198876b7048d1d6cff380011e6a72779ce4285856951e1fa05
3
+ size 328492280
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "extra_special_tokens": {},
51
+ "mask_token": "<mask>",
52
+ "model_max_length": 512,
53
+ "pad_token": "<pad>",
54
+ "sep_token": "</s>",
55
+ "tokenizer_class": "RobertaTokenizer",
56
+ "trim_offsets": true,
57
+ "unk_token": "<unk>"
58
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f15cbd61cefd39e9a728c08ffcb3a729d0182a60e8d96281339f9800bbedc8e0
3
+ size 5368
training_info.json ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_info": {
3
+ "model_type": "distilroberta-base",
4
+ "task": "binary_classification",
5
+ "classes": [
6
+ "human",
7
+ "bot"
8
+ ],
9
+ "num_parameters": "82M",
10
+ "framework": "transformers",
11
+ "pytorch_version": ">=1.12.0"
12
+ },
13
+ "training_methodology": {
14
+ "method": "class_weighted_cross_validation",
15
+ "cv_folds": 5,
16
+ "cv_strategy": "stratified",
17
+ "early_stopping": true,
18
+ "early_stopping_patience": 3,
19
+ "mixed_precision": "fp16"
20
+ },
21
+ "hyperparameters": {
22
+ "batch_size_per_device": 2,
23
+ "gradient_accumulation_steps": 8,
24
+ "max_epochs": 10,
25
+ "weight_decay": 0.01,
26
+ "optimizer": "AdamW"
27
+ },
28
+ "performance_summary": {
29
+ "cv_metrics": {
30
+ "accuracy": {
31
+ "mean": 0.9433,
32
+ "std": 0.0052,
33
+ "min": 0.9385,
34
+ "max": 0.9497
35
+ },
36
+ "f1_weighted": {
37
+ "mean": 0.9434,
38
+ "std": 0.0051,
39
+ "min": 0.9387,
40
+ "max": 0.9497
41
+ },
42
+ "f1_macro": {
43
+ "mean": 0.9419,
44
+ "std": 0.0052,
45
+ "min": 0.9371,
46
+ "max": 0.9483
47
+ },
48
+ "precision_weighted": {
49
+ "mean": 0.9444,
50
+ "std": 0.0045,
51
+ "min": 0.9397,
52
+ "max": 0.9498
53
+ },
54
+ "recall_weighted": {
55
+ "mean": 0.9433,
56
+ "std": 0.0052,
57
+ "min": 0.9385,
58
+ "max": 0.9497
59
+ }
60
+ },
61
+ "test_metrics": {
62
+ "loss": 0.1511,
63
+ "accuracy": 0.9423,
64
+ "precision_weighted": 0.9428,
65
+ "recall_weighted": 0.9423,
66
+ "f1_weighted": 0.9424,
67
+ "precision_macro": 0.9393,
68
+ "recall_macro": 0.9427,
69
+ "f1_macro": 0.9409,
70
+ "runtime": 121.6927,
71
+ "samples_per_second": 232.832,
72
+ "steps_per_second": 8.316
73
+ }
74
+ }
75
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff