Reynier commited on
Commit
e99d730
Β·
verified Β·
1 Parent(s): b5fd362

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +371 -0
README.md ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - dga-detection
6
+ - cybersecurity
7
+ - domain-generation-algorithm
8
+ - wordlist-dga
9
+ - bert
10
+ - mixture-of-experts
11
+ - security
12
+ datasets:
13
+ - wordlist-dga-train-160k
14
+ metrics:
15
+ - f1
16
+ - precision
17
+ - recall
18
+ library_name: transformers
19
+ pipeline_tag: text-classification
20
+ ---
21
+
22
+ # Expert Models for Wordlist-Based DGA Detection
23
+
24
+ This repository contains the **complete collection of models, datasets, and evaluation notebooks** from the research paper:
25
+
26
+ **"Expert Selection for Wordlist-Based DGA Detection"** (Currently Under Review)
27
+
28
+ *Reynier Leyva La O, Carlos A. Catania, and Rodrigo Gonzalez*
29
+
30
+ ---
31
+
32
+ ## 🎯 Overview
33
+
34
+ This work presents a systematic evaluation of **seven expert model candidates** for detecting wordlist-based Domain Generation Algorithms (DGAs). Through rigorous two-phase evaluation, **ModernBERT** was identified as the optimal expert model, achieving:
35
+
36
+ - **86.7% F1-score** on known DGA families
37
+ - **80.9% F1-score** on previously unseen families
38
+ - **26ms inference time** on NVIDIA Tesla T4 GPU (~38 domains/second)
39
+ - **9.4% improvement** over generalist approaches on known families
40
+ - **30.2% improvement** over generalist approaches on unknown families
41
+
42
+ ---
43
+
44
+ ## πŸ“¦ Repository Contents
45
+
46
+ ```
47
+ moe-wordlist-dga-models/
48
+ β”‚
49
+ β”œβ”€β”€ models/
50
+ β”‚ β”œβ”€β”€ modernbert-wordlist-expert/ ⭐ Optimal model (8 wordlist families)
51
+ β”‚ β”œβ”€β”€ modernbert-generalist-54f/ πŸ“Š Generalist baseline (54 families)
52
+ β”‚ β”œβ”€β”€ dombert-url/ πŸ”¬ Domain-URL BERT
53
+ β”‚ β”œβ”€β”€ gemma-3-4b-lora/ πŸ€– Gemma 3 4B LoRA adapters
54
+ β”‚ β”œβ”€β”€ llama-3.2-3b-lora/ πŸ€– LLaMA 3.2 3B LoRA adapters
55
+ β”‚ β”œβ”€β”€ cnn-wordlist/ ⚑ Character-level CNN
56
+ β”‚ β”œβ”€β”€ fanci/ πŸ”§ FANCI Random Forest
57
+ β”‚ └── labin/ πŸ”§ LA Bin07 hybrid
58
+ β”‚
59
+ β”œβ”€β”€ datasets/
60
+ β”‚ β”œβ”€β”€ train_wl.csv πŸ“Š Training set (160K domains)
61
+ β”‚ └── test_sets/ πŸ§ͺ Test sets (in-family + generalization)
62
+ β”‚
63
+ β”œβ”€β”€ notebooks/ πŸ““ All training & evaluation notebooks
64
+ β”‚
65
+ └── scripts/ 🐍 Inference & evaluation scripts
66
+ ```
67
+
68
+ ---
69
+
70
+ ## πŸš€ Quick Start
71
+
72
+ ### Option 1: Use the Optimal Model (Recommended)
73
+
74
+ The **ModernBERT wordlist expert** is the best-performing model and easiest to use:
75
+
76
+ ```python
77
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
78
+ import torch
79
+
80
+ # Load model and tokenizer
81
+ model_name = "Reynier/moe-wordlist-dga-models"
82
+ subfolder = "models/modernbert-wordlist-expert"
83
+
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name, subfolder=subfolder)
85
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, subfolder=subfolder)
86
+
87
+ # Classify a domain
88
+ domain = "secure-banking-portal.com"
89
+ inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128)
90
+ outputs = model(**inputs)
91
+ prediction = torch.softmax(outputs.logits, dim=1)
92
+
93
+ print(f"Benign: {prediction[0][0]:.4f}")
94
+ print(f"DGA: {prediction[0][1]:.4f}")
95
+ ```
96
+
97
+ ### Option 2: Download Specific Models
98
+
99
+ ```python
100
+ from huggingface_hub import snapshot_download
101
+
102
+ # Download only ModernBERT expert
103
+ snapshot_download(
104
+ repo_id="Reynier/moe-wordlist-dga-models",
105
+ allow_patterns="models/modernbert-wordlist-expert/*",
106
+ local_dir="./models"
107
+ )
108
+
109
+ # Download all models
110
+ snapshot_download(
111
+ repo_id="Reynier/moe-wordlist-dga-models",
112
+ local_dir="./complete-repo"
113
+ )
114
+ ```
115
+
116
+ ### Option 3: Clone Entire Repository
117
+
118
+ ```bash
119
+ git lfs install
120
+ git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
121
+ cd moe-wordlist-dga-models
122
+ ```
123
+
124
+ ---
125
+
126
+ ## πŸ“Š Model Performance Comparison
127
+
128
+ ### Known Families (n=8)
129
+
130
+ | Model | Precision | Recall | F1-Score | FPR | Inference Time |
131
+ |-------|-----------|--------|----------|-----|----------------|
132
+ | **ModernBERT** ⭐ | 89.7 ± 4.1% | **86.6 ± 3.1%** | **86.7 ± 3.0%** | 9.0 ± 3.8% | **26ms** |
133
+ | LA_Bin07 | 84.6 Β± 5.9% | 82.3 Β± 3.3% | 81.7 Β± 3.8% | 12.0 Β± 5.9% | 80ms |
134
+ | CNN | 80.9 Β± 5.7% | 80.0 Β± 4.1% | 78.9 Β± 4.0% | 15.3 Β± 5.5% | <1ms |
135
+ | Gemma 3 4B | **95.4 Β± 3.6%** | 66.5 Β± 5.7% | 75.2 Β± 4.8% | **2.5 Β± 2.2%** | 1413ms |
136
+ | DomBertUrl | 81.2 Β± 6.4% | 69.0 Β± 6.9% | 72.4 Β± 5.8% | 12.8 Β± 5.0% | 13ms |
137
+ | FANCI | 70.3 Β± 4.8% | 72.7 Β± 5.4% | 70.5 Β± 4.9% | 27.6 Β± 5.5% | 310ms |
138
+ | LLaMA 3.2 3B | 92.4 Β± 5.5% | 41.9 Β± 8.8% | 54.7 Β± 8.8% | 2.9 Β± 2.2% | 656ms |
139
+
140
+ ### Unknown Families (n=3) - Generalization Test
141
+
142
+ | Model | Precision | Recall | F1-Score | FPR | Inference Time |
143
+ |-------|-----------|--------|----------|-----|----------------|
144
+ | **DomBertUrl** | 87.7 Β± 4.2% | **82.3 Β± 4.5%** | **84.6 Β± 3.5%** | 11.5 Β± 4.3% | **13ms** |
145
+ | **ModernBERT** ⭐ | 89.0 ± 4.4% | 75.5 ± 5.6% | 80.9 ± 4.5% | 9.1 ± 4.1% | 35ms |
146
+ | Gemma 3 4B | **95.7 Β± 4.4%** | 60.3 Β± 5.9% | 70.8 Β± 5.0% | **2.2 Β± 2.1%** | 1390ms |
147
+ | CNN | 76.9 Β± 6.9% | 60.2 Β± 4.9% | 65.5 Β± 5.3% | 15.9 Β± 5.4% | <1ms |
148
+ | LLaMA 3.2 3B | 60.5 Β± 4.4% | 68.8 Β± 4.9% | 63.4 Β± 4.2% | 39.8 Β± 5.8% | 693ms |
149
+ | LA_Bin07 | 73.0 Β± 9.1% | 45.7 Β± 5.3% | 53.7 Β± 5.7% | 14.1 Β± 5.6% | 80ms |
150
+ | FANCI | 51.8 Β± 7.6% | 32.0 Β± 6.5% | 39.1 Β± 6.7% | 27.6 Β± 5.5% | 284ms |
151
+
152
+ > **Note:** Metrics reported as mean Β± standard deviation across 30 randomized batches per family. Inference times measured on NVIDIA Tesla T4 GPU.
153
+
154
+ ---
155
+
156
+ ## πŸ”¬ Specialist vs. Generalist Validation
157
+
158
+ Direct comparison of **specialist training** (8 wordlist families only) vs. **generalist training** (54 diverse families):
159
+
160
+ | Scenario | Specialist F1 | Generalist F1 | Improvement |
161
+ |----------|---------------|---------------|-------------|
162
+ | Known families | **86.7%** | 79.2% | **+9.4%** |
163
+ | Unknown families | **80.9%** | 62.1% | **+30.2%** |
164
+
165
+ This demonstrates that **domain-specific expert training significantly outperforms broad exposure** to diverse DGA types.
166
+
167
+ ---
168
+
169
+ ## πŸ’» Using Individual Models
170
+
171
+ ### ModernBERT Generalist (54 families)
172
+
173
+ ```python
174
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
175
+
176
+ model = AutoModelForSequenceClassification.from_pretrained(
177
+ "Reynier/moe-wordlist-dga-models",
178
+ subfolder="models/modernbert-generalist-54f"
179
+ )
180
+ tokenizer = AutoTokenizer.from_pretrained(
181
+ "Reynier/moe-wordlist-dga-models",
182
+ subfolder="models/modernbert-generalist-54f"
183
+ )
184
+ ```
185
+
186
+ ### Gemma 3 4B with LoRA Adapters
187
+
188
+ ```python
189
+ from transformers import AutoModelForCausalLM, AutoTokenizer
190
+ from peft import PeftModel
191
+
192
+ # Load base model
193
+ base_model = AutoModelForCausalLM.from_pretrained(
194
+ "google/gemma-3-4b-it",
195
+ device_map="auto"
196
+ )
197
+
198
+ # Load LoRA adapters from this repo
199
+ model = PeftModel.from_pretrained(
200
+ base_model,
201
+ "Reynier/moe-wordlist-dga-models",
202
+ subfolder="models/gemma-3-4b-lora"
203
+ )
204
+
205
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
206
+ ```
207
+
208
+ ### LLaMA 3.2 3B with LoRA Adapters
209
+
210
+ ```python
211
+ from transformers import AutoModelForCausalLM, AutoTokenizer
212
+ from peft import PeftModel
213
+
214
+ # Load base model
215
+ base_model = AutoModelForCausalLM.from_pretrained(
216
+ "meta-llama/Llama-3.2-3B-Instruct",
217
+ device_map="auto"
218
+ )
219
+
220
+ # Load LoRA adapters
221
+ model = PeftModel.from_pretrained(
222
+ base_model,
223
+ "Reynier/moe-wordlist-dga-models",
224
+ subfolder="models/llama-3.2-3b-lora"
225
+ )
226
+
227
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
228
+ ```
229
+
230
+ ### FANCI Random Forest
231
+
232
+ ```python
233
+ import pickle
234
+ from huggingface_hub import hf_hub_download
235
+
236
+ # Download model
237
+ model_path = hf_hub_download(
238
+ repo_id="Reynier/moe-wordlist-dga-models",
239
+ filename="models/fanci/fanci_model.pkl"
240
+ )
241
+
242
+ # Load
243
+ with open(model_path, 'rb') as f:
244
+ model = pickle.load(f)
245
+
246
+ # Use feature extractor (included in repo)
247
+ from models.fanci.feature_extractor import extract_features
248
+ features = extract_features(domain)
249
+ prediction = model.predict([features])
250
+ ```
251
+
252
+ ---
253
+
254
+ ## πŸ“Š Dataset Information
255
+
256
+ ### Training Dataset
257
+
258
+ - **Total samples:** 160,000 (balanced 50/50)
259
+ - **DGA samples:** 80,000 from 8 wordlist-based families
260
+ - **Benign samples:** 80,000 from Tranco top sites
261
+
262
+ **DGA Families (Training):**
263
+ - charbot (10,000 samples)
264
+ - deception (10,000 samples)
265
+ - gozi (10,000 samples)
266
+ - manuelita (10,000 samples)
267
+ - matsnu (10,000 samples)
268
+ - nymaim (10,000 samples)
269
+ - rovnix (10,000 samples)
270
+ - suppobox (10,000 samples)
271
+
272
+ **Generalization Test Families (Unknown):**
273
+ - bigviktor (1,500 samples)
274
+ - ngioweb (1,500 samples)
275
+ - pizd (1,500 samples)
276
+
277
+ ### Dataset Format
278
+
279
+ ```csv
280
+ domain,family,label
281
+ secure-banking-portal.com,suppobox,1
282
+ google.com,benign,0
283
+ random-check-system.net,matsnu,1
284
+ ```
285
+
286
+ ---
287
+
288
+ ## πŸ““ Reproducing Paper Results
289
+
290
+ All training and evaluation notebooks are included in the `notebooks/` directory:
291
+
292
+ 1. **Clone this repository:**
293
+ ```bash
294
+ git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
295
+ cd moe-wordlist-dga-models/notebooks
296
+ ```
297
+
298
+ 2. **Install dependencies:**
299
+ ```bash
300
+ pip install torch transformers scikit-learn pandas numpy matplotlib seaborn jupyter
301
+ ```
302
+
303
+ 3. **Run notebooks:**
304
+ ```bash
305
+ jupyter notebook ModernBERT_base_DGA_Word.ipynb
306
+ ```
307
+
308
+ ### Available Notebooks
309
+
310
+ - `ModernBERT_base_DGA_Word.ipynb` - Optimal expert training
311
+ - `ModernBERT_base_DGA_54F.ipynb` - Generalist baseline
312
+ - `Train_Gemma3_4B_DGA_WordList.ipynb` - Gemma LoRA training
313
+ - `Train_llama3B_DGA_WordList.ipynb` - LLaMA LoRA training
314
+ - `DomUrlBert.ipynb` - DomBertUrl training
315
+ - `CNN_Patron_WL.ipynb` - CNN training
316
+ - `FANCI.ipynb` - Random Forest baseline
317
+ - `Labin_wl.ipynb` - LA Bin07 hybrid
318
+
319
+ ---
320
+
321
+ ## πŸ” Inference Scripts
322
+
323
+ Ready-to-use Python scripts are available in `scripts/`:
324
+
325
+ ```bash
326
+ # Classify single domain with optimal model
327
+ python scripts/classify_domain.py "secure-banking-portal.com"
328
+
329
+ # Batch classification from CSV
330
+ python scripts/batch_classify.py --input domains.csv --output results.csv
331
+
332
+ # Compare all models
333
+ python scripts/compare_all_models.py --domain "test-domain.com"
334
+ ```
335
+
336
+ ---
337
+
338
+ ## πŸŽ“ Citation
339
+
340
+ ```bibtex
341
+ @article{leyva2025expert,
342
+ title={Expert Selection for Wordlist-Based DGA Detection},
343
+ author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo},
344
+ journal={Under Review},
345
+ year={2025}
346
+ }
347
+ ```
348
+
349
+ ---
350
+
351
+ ## πŸ“„ License
352
+
353
+ MIT License - See LICENSE file for details
354
+
355
+ ---
356
+
357
+ ## 🀝 Contributing & Contact
358
+
359
+ For questions regarding model usage or experimental reproducibility:
360
+ - **Email:** [email protected]
361
+ - **GitHub:** https://github.com/reypapin/MoE-word-list-dga-detection
362
+ - **Issues:** Open an issue on GitHub for technical questions
363
+
364
+ ---
365
+
366
+ ## πŸ™ Acknowledgments
367
+
368
+ - **Hardware:** NVIDIA Tesla T4 GPUs provided by Google Colab
369
+ - **Datasets:** DGArchive, 360 Netlab, UMUDga repositories, Tranco list
370
+ - **Base Models:** Answer.AI (ModernBERT), Google (Gemma), Meta (LLaMA)
371
+ - **Funding:** National Scientific and Technical Research Council (CONICET), Argentina