π‘οΈ Expert Models for Wordlist-Based DGA Detection
Systematic evaluation of seven expert models for detecting wordlist-based Domain Generation Algorithms (DGAs), identifying ModernBERT as the optimal expert achieving 86.7% F1-score on known families and 80.9% on unseen variants.
π Overview
This repository contains the complete implementation of expert model evaluation for wordlist-based DGA detection, as described in our research paper (currently under review). Wordlist-based DGAs generate linguistically coherent domains that evade traditional detection methods, making them particularly challenging for cybersecurity systems.
π― Key Findings
| Model | Known F1 | Unknown F1 | Inference Time | Throughput |
|---|---|---|---|---|
| ModernBERT β | 86.7% | 80.9% | 26ms | 38k domains/s |
| Gemma 3 4B LoRA | 82.1% | 75.3% | 650ms | 1.5k/s |
| LLaMA 3.2 3B LoRA | 81.4% | 74.8% | 680ms | 1.4k/s |
| DomBertUrl | 81.2% | 84.6% | 28ms | 35k/s |
| CNN | 78.9% | 72.1% | 15ms | 66k/s |
| FANCI (RF) | 77.3% | 68.5% | <1ms | >100k/s |
| LABin | 75.6% | 70.2% | 18ms | 55k/s |
Key Improvement: Specialist training provides +9.4% F1-score improvement over generalist approaches on known families and +30.2% on unseen families.
π Quick Start
Option 1: Use Pre-trained Model (Recommended)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the optimal expert model
model_name = "Reynier/moe-wordlist-dga-models"
model_path = f"{model_name}/models/modernbert-wordlist-expert"
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
subfolder="models/modernbert-wordlist-expert"
)
# Classify a domain
domain = "secure-banking-portal.com"
inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
print(f"Benign: {probs[0][0]:.4f}")
print(f"DGA: {probs[0][1]:.4f}")
Option 2: Clone and Explore
# Clone the repository
git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
cd moe-wordlist-dga-models
# Explore available models
ls models/
# Load datasets
python3 << EOF
import csv
with open('datasets/train_wl.csv', 'r') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
if i < 5:
print(row)
EOF
π¦ Repository Contents
π€ Models (7 Expert Candidates)
All models are located in models/ directory:
modernbert-wordlist-expert/ β (OPTIMAL)
- Base:
answerdotai/ModernBERT-base - F1-score: 86.7% (known), 80.9% (unknown)
- Inference: 26ms on Tesla T4
- Size: 575 MB
- Base:
modernbert-generalist-54f/ (Baseline)
- Trained on 54 diverse DGA families
- F1-score: 79.2% (known), 62.1% (unknown)
- Demonstrates specialist advantage
gemma-3-4b-lora/
- LoRA adapters for
google/gemma-3-4b-it - Exceptional precision (95.4%), lower recall (66.5%)
- Size: 95 MB (adapters only)
- LoRA adapters for
llama-3.2-3b-lora/
- LoRA adapters for
meta-llama/Llama-3.2-3B - Balanced performance, slow inference
- Size: 110 MB (adapters only)
- LoRA adapters for
dombert-url/
- Domain-specialized BERT variant
- Strong generalization (84.6% on unknown)
- Size: 1.4 MB (LoRA adapters)
cnn-wordlist/
- Convolutional neural network
- Fastest inference (15ms), moderate accuracy
- Size: 76 KB
fanci/
- Random Forest with engineered features
- Traditional ML baseline
- Size: 794 MB (includes dictionaries)
labin/
- Hybrid linguistic-attention model
- Keras implementation
- Size: 8.1 MB
π Datasets
All datasets are in datasets/ directory:
Training Datasets
train_wl.csv (160,000 samples)
- Purpose: Train expert models (specialist approach)
- DGA Families (8): charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox (10K each)
- Benign: 80,000 domains from Tranco top sites
- Distribution: Balanced (50% DGA, 50% benign)
- Format:
domain,family,label
train_1M.csv (1,080,000 samples)
- Purpose: Train generalist model (baseline comparison)
- DGA Families: 54 diverse families (wordlist + algorithmic)
- Distribution: Diverse multi-family dataset
Test Datasets
test-known/ (8 families)
- Purpose: Evaluate performance on training families
- Total samples: 723,847
- Families:
- charbot: 11,001 samples
- deception: 30,001 samples
- gozi: 50,212 samples
- manuelita: 20,001 samples
- matsnu: 116,480 samples
- nymaim: 217,773 samples
- rovnix: 120,351 samples
- suppobox: 158,028 samples
- Format: Compressed
.gzfiles (one per family)
test-generalization/ (3 families)
- Purpose: Test generalization to unseen wordlist-based DGAs
- Total samples: 13,562
- Families:
- bigviktor: 2,001 samples
- ngioweb: 2,001 samples
- pizd: 9,560 samples
- Format: Compressed
.gzfiles
π Notebooks
Training and evaluation notebooks in notebooks/:
ModernBERT_base_DGA_Word.ipynb- Train ModernBERT expert (8 families)ModernBERT_base_DGA_54F.ipynb- Train ModernBERT generalist (54 families)Train_Gemma3_4B_DGA_WordList.ipynb- Fine-tune Gemma with LoRATrain_llama3B_DGA_WordList.ipynb- Fine-tune LLaMA with LoRATest_Gemma3_4B_DGA_Last.ipynb- Evaluate Gemma modelTest__llama3B_DGA.ipynb- Evaluate LLaMA modelDomUrlBert.ipynb- Train DomBertUrl modelCNN_Patron_WL.ipynb- Train CNN modelFANCI.ipynb- Train FANCI Random ForestLabin_wl.ipynb- Train LABin model
π¬ Research Methodology
Two-Phase Evaluation Protocol
Phase 1: Known Families Performance
- Evaluate on 8 training families
- 30 test batches Γ 100 domains per family
- Measures detection accuracy on familiar variants
Phase 2: Generalization Capability
- Evaluate on 3 unseen wordlist-based families
- Tests robustness against novel DGA variants
- Critical for real-world deployment
Evaluation Metrics
- Precision: Accuracy of DGA predictions
- Recall: Coverage of actual DGAs
- F1-Score: Harmonic mean (primary metric)
- False Positive Rate (FPR): Benign misclassification rate
- Inference Time: Real-world performance (Tesla T4 GPU)
DGA Families
Training Families (8 wordlist-based):
- charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox
Generalization Test (3 wordlist-based):
- bigviktor, ngioweb, pizd
π οΈ Installation & Requirements
Basic Requirements
pip install torch transformers safetensors
For LLM Models (Gemma/LLaMA)
pip install peft accelerate bitsandbytes
For Traditional Models (FANCI)
pip install scikit-learn joblib
For LABin Model
pip install tensorflow keras
GPU Recommendations
- Optimal: NVIDIA Tesla T4 or better
- Minimum: 8GB VRAM for ModernBERT
- LLMs: 16GB+ VRAM (or use 8-bit quantization)
π Reproducibility
All experiments are fully reproducible:
- Download datasets from
datasets/folder - Run training notebooks from
notebooks/ - Load pre-trained models from
models/ - Verify reported metrics using test sets
Expected Results (Β±std)
| Model | Known F1 | Unknown F1 |
|---|---|---|
| ModernBERT | 86.7% Β± 3.0% | 80.9% Β± 4.5% |
| Generalist | 79.2% Β± 3.5% | 62.1% Β± 5.2% |
π Citation
This work is currently under review. Preliminary citation:
@article{leyva2025expert,
title={Expert Selection for Wordlist-Based DGA Detection: A Systematic Evaluation},
author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo},
journal={Under Review},
year={2025}
}
π Links
- GitHub Repository: MoE-word-list-dga-detection
- Paper: Under Review
- Contact: [email protected]
π License
This project is licensed under the MIT License - see LICENSE file for details.
π Acknowledgments
- Datasets: DGArchive, 360 Netlab, UMUDga, Tranco
- Base Models: ModernBERT (Answer.AI), Gemma (Google), LLaMA (Meta)
- Infrastructure: CONICET Argentina
π Quick Navigation
Last Updated: October 2025
- Downloads last month
- 185