🛡️ Expert Models for Wordlist-Based DGA Detection

Systematic evaluation of seven expert models for detecting wordlist-based Domain Generation Algorithms (DGAs), identifying ModernBERT as the optimal expert achieving 86.7% F1-score on known families and 80.9% on unseen variants.

📋 Overview

This repository contains the complete implementation of expert model evaluation for wordlist-based DGA detection, as described in our research paper (currently under review). Wordlist-based DGAs generate linguistically coherent domains that evade traditional detection methods, making them particularly challenging for cybersecurity systems.

🎯 Key Findings

Model	Known F1	Unknown F1	Inference Time	Throughput
ModernBERT ⭐	86.7%	80.9%	26ms	38k domains/s
Gemma 3 4B LoRA	82.1%	75.3%	650ms	1.5k/s
LLaMA 3.2 3B LoRA	81.4%	74.8%	680ms	1.4k/s
DomBertUrl	81.2%	84.6%	28ms	35k/s
CNN	78.9%	72.1%	15ms	66k/s
FANCI (RF)	77.3%	68.5%	<1ms	>100k/s
LABin	75.6%	70.2%	18ms	55k/s

Key Improvement: Specialist training provides +9.4% F1-score improvement over generalist approaches on known families and +30.2% on unseen families.

🚀 Quick Start

Option 1: Use Pre-trained Model (Recommended)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the optimal expert model
model_name = "Reynier/moe-wordlist-dga-models"
model_path = f"{model_name}/models/modernbert-wordlist-expert"

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    subfolder="models/modernbert-wordlist-expert"
)

# Classify a domain
domain = "secure-banking-portal.com"
inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)

print(f"Benign: {probs[0][0]:.4f}")
print(f"DGA:    {probs[0][1]:.4f}")

Option 2: Clone and Explore

# Clone the repository
git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
cd moe-wordlist-dga-models

# Explore available models
ls models/

# Load datasets
python3 << EOF
import csv
with open('datasets/train_wl.csv', 'r') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        if i < 5:
            print(row)
EOF

📦 Repository Contents

🤖 Models (7 Expert Candidates)

All models are located in models/ directory:

modernbert-wordlist-expert/ ⭐ (OPTIMAL)
- Base: answerdotai/ModernBERT-base
- F1-score: 86.7% (known), 80.9% (unknown)
- Inference: 26ms on Tesla T4
- Size: 575 MB
modernbert-generalist-54f/ (Baseline)
- Trained on 54 diverse DGA families
- F1-score: 79.2% (known), 62.1% (unknown)
- Demonstrates specialist advantage
gemma-3-4b-lora/
- LoRA adapters for google/gemma-3-4b-it
- Exceptional precision (95.4%), lower recall (66.5%)
- Size: 95 MB (adapters only)
llama-3.2-3b-lora/
- LoRA adapters for meta-llama/Llama-3.2-3B
- Balanced performance, slow inference
- Size: 110 MB (adapters only)
dombert-url/
- Domain-specialized BERT variant
- Strong generalization (84.6% on unknown)
- Size: 1.4 MB (LoRA adapters)
cnn-wordlist/
- Convolutional neural network
- Fastest inference (15ms), moderate accuracy
- Size: 76 KB
fanci/
- Random Forest with engineered features
- Traditional ML baseline
- Size: 794 MB (includes dictionaries)
labin/
- Hybrid linguistic-attention model
- Keras implementation
- Size: 8.1 MB

📊 Datasets

All datasets are in datasets/ directory:

Training Datasets

train_wl.csv (160,000 samples)
- Purpose: Train expert models (specialist approach)
- DGA Families (8): charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox (10K each)
- Benign: 80,000 domains from Tranco top sites
- Distribution: Balanced (50% DGA, 50% benign)
- Format: domain,family,label
train_1M.csv (1,080,000 samples)
- Purpose: Train generalist model (baseline comparison)
- DGA Families: 54 diverse families (wordlist + algorithmic)
- Distribution: Diverse multi-family dataset

Test Datasets

test-known/ (8 families)
- Purpose: Evaluate performance on training families
- Total samples: 723,847
- Families:
  - charbot: 11,001 samples
  - deception: 30,001 samples
  - gozi: 50,212 samples
  - manuelita: 20,001 samples
  - matsnu: 116,480 samples
  - nymaim: 217,773 samples
  - rovnix: 120,351 samples
  - suppobox: 158,028 samples
- Format: Compressed .gz files (one per family)
test-generalization/ (3 families)
- Purpose: Test generalization to unseen wordlist-based DGAs
- Total samples: 13,562
- Families:
  - bigviktor: 2,001 samples
  - ngioweb: 2,001 samples
  - pizd: 9,560 samples
- Format: Compressed .gz files

📓 Notebooks

Training and evaluation notebooks in notebooks/:

ModernBERT_base_DGA_Word.ipynb - Train ModernBERT expert (8 families)
ModernBERT_base_DGA_54F.ipynb - Train ModernBERT generalist (54 families)
Train_Gemma3_4B_DGA_WordList.ipynb - Fine-tune Gemma with LoRA
Train_llama3B_DGA_WordList.ipynb - Fine-tune LLaMA with LoRA
Test_Gemma3_4B_DGA_Last.ipynb - Evaluate Gemma model
Test__llama3B_DGA.ipynb - Evaluate LLaMA model
DomUrlBert.ipynb - Train DomBertUrl model
CNN_Patron_WL.ipynb - Train CNN model
FANCI.ipynb - Train FANCI Random Forest
Labin_wl.ipynb - Train LABin model

🔬 Research Methodology

Two-Phase Evaluation Protocol

Phase 1: Known Families Performance
- Evaluate on 8 training families
- 30 test batches × 100 domains per family
- Measures detection accuracy on familiar variants
Phase 2: Generalization Capability
- Evaluate on 3 unseen wordlist-based families
- Tests robustness against novel DGA variants
- Critical for real-world deployment

Evaluation Metrics

Precision: Accuracy of DGA predictions
Recall: Coverage of actual DGAs
F1-Score: Harmonic mean (primary metric)
False Positive Rate (FPR): Benign misclassification rate
Inference Time: Real-world performance (Tesla T4 GPU)

DGA Families

Training Families (8 wordlist-based):

charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox

Generalization Test (3 wordlist-based):

bigviktor, ngioweb, pizd

🛠️ Installation & Requirements

Basic Requirements

pip install torch transformers safetensors

For LLM Models (Gemma/LLaMA)

pip install peft accelerate bitsandbytes

For Traditional Models (FANCI)

pip install scikit-learn joblib

For LABin Model

pip install tensorflow keras

GPU Recommendations

Optimal: NVIDIA Tesla T4 or better
Minimum: 8GB VRAM for ModernBERT
LLMs: 16GB+ VRAM (or use 8-bit quantization)

📊 Reproducibility

All experiments are fully reproducible:

Download datasets from datasets/ folder
Run training notebooks from notebooks/
Load pre-trained models from models/
Verify reported metrics using test sets

Expected Results (±std)

Model	Known F1	Unknown F1
ModernBERT	86.7% ± 3.0%	80.9% ± 4.5%
Generalist	79.2% ± 3.5%	62.1% ± 5.2%

📖 Citation

This work is currently under review. Preliminary citation:

@article{leyva2025expert,
  title={Expert Selection for Wordlist-Based DGA Detection: A Systematic Evaluation},
  author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo},
  journal={Under Review},
  year={2025}
}

🔗 Links

GitHub Repository: MoE-word-list-dga-detection
Paper: Under Review
Contact: [email protected]

📄 License

This project is licensed under the MIT License - see LICENSE file for details.

🙏 Acknowledgments

Datasets: DGArchive, 360 Netlab, UMUDga, Tranco
Base Models: ModernBERT (Answer.AI), Gemma (Google), LLaMA (Meta)
Infrastructure: CONICET Argentina

🔍 Quick Navigation

Last Updated: October 2025

Downloads last month: 185

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support