πŸ›‘οΈ Expert Models for Wordlist-Based DGA Detection

Paper Models Dataset License

Systematic evaluation of seven expert models for detecting wordlist-based Domain Generation Algorithms (DGAs), identifying ModernBERT as the optimal expert achieving 86.7% F1-score on known families and 80.9% on unseen variants.


πŸ“‹ Overview

This repository contains the complete implementation of expert model evaluation for wordlist-based DGA detection, as described in our research paper (currently under review). Wordlist-based DGAs generate linguistically coherent domains that evade traditional detection methods, making them particularly challenging for cybersecurity systems.

🎯 Key Findings

Model Known F1 Unknown F1 Inference Time Throughput
ModernBERT ⭐ 86.7% 80.9% 26ms 38k domains/s
Gemma 3 4B LoRA 82.1% 75.3% 650ms 1.5k/s
LLaMA 3.2 3B LoRA 81.4% 74.8% 680ms 1.4k/s
DomBertUrl 81.2% 84.6% 28ms 35k/s
CNN 78.9% 72.1% 15ms 66k/s
FANCI (RF) 77.3% 68.5% <1ms >100k/s
LABin 75.6% 70.2% 18ms 55k/s

Key Improvement: Specialist training provides +9.4% F1-score improvement over generalist approaches on known families and +30.2% on unseen families.


πŸš€ Quick Start

Option 1: Use Pre-trained Model (Recommended)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the optimal expert model
model_name = "Reynier/moe-wordlist-dga-models"
model_path = f"{model_name}/models/modernbert-wordlist-expert"

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    subfolder="models/modernbert-wordlist-expert"
)

# Classify a domain
domain = "secure-banking-portal.com"
inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)

print(f"Benign: {probs[0][0]:.4f}")
print(f"DGA:    {probs[0][1]:.4f}")

Option 2: Clone and Explore

# Clone the repository
git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
cd moe-wordlist-dga-models

# Explore available models
ls models/

# Load datasets
python3 << EOF
import csv
with open('datasets/train_wl.csv', 'r') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        if i < 5:
            print(row)
EOF

πŸ“¦ Repository Contents

πŸ€– Models (7 Expert Candidates)

All models are located in models/ directory:

  1. modernbert-wordlist-expert/ ⭐ (OPTIMAL)

    • Base: answerdotai/ModernBERT-base
    • F1-score: 86.7% (known), 80.9% (unknown)
    • Inference: 26ms on Tesla T4
    • Size: 575 MB
  2. modernbert-generalist-54f/ (Baseline)

    • Trained on 54 diverse DGA families
    • F1-score: 79.2% (known), 62.1% (unknown)
    • Demonstrates specialist advantage
  3. gemma-3-4b-lora/

    • LoRA adapters for google/gemma-3-4b-it
    • Exceptional precision (95.4%), lower recall (66.5%)
    • Size: 95 MB (adapters only)
  4. llama-3.2-3b-lora/

    • LoRA adapters for meta-llama/Llama-3.2-3B
    • Balanced performance, slow inference
    • Size: 110 MB (adapters only)
  5. dombert-url/

    • Domain-specialized BERT variant
    • Strong generalization (84.6% on unknown)
    • Size: 1.4 MB (LoRA adapters)
  6. cnn-wordlist/

    • Convolutional neural network
    • Fastest inference (15ms), moderate accuracy
    • Size: 76 KB
  7. fanci/

    • Random Forest with engineered features
    • Traditional ML baseline
    • Size: 794 MB (includes dictionaries)
  8. labin/

    • Hybrid linguistic-attention model
    • Keras implementation
    • Size: 8.1 MB

πŸ“Š Datasets

All datasets are in datasets/ directory:

Training Datasets

  1. train_wl.csv (160,000 samples)

    • Purpose: Train expert models (specialist approach)
    • DGA Families (8): charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox (10K each)
    • Benign: 80,000 domains from Tranco top sites
    • Distribution: Balanced (50% DGA, 50% benign)
    • Format: domain,family,label
  2. train_1M.csv (1,080,000 samples)

    • Purpose: Train generalist model (baseline comparison)
    • DGA Families: 54 diverse families (wordlist + algorithmic)
    • Distribution: Diverse multi-family dataset

Test Datasets

  1. test-known/ (8 families)

    • Purpose: Evaluate performance on training families
    • Total samples: 723,847
    • Families:
      • charbot: 11,001 samples
      • deception: 30,001 samples
      • gozi: 50,212 samples
      • manuelita: 20,001 samples
      • matsnu: 116,480 samples
      • nymaim: 217,773 samples
      • rovnix: 120,351 samples
      • suppobox: 158,028 samples
    • Format: Compressed .gz files (one per family)
  2. test-generalization/ (3 families)

    • Purpose: Test generalization to unseen wordlist-based DGAs
    • Total samples: 13,562
    • Families:
      • bigviktor: 2,001 samples
      • ngioweb: 2,001 samples
      • pizd: 9,560 samples
    • Format: Compressed .gz files

πŸ““ Notebooks

Training and evaluation notebooks in notebooks/:

  • ModernBERT_base_DGA_Word.ipynb - Train ModernBERT expert (8 families)
  • ModernBERT_base_DGA_54F.ipynb - Train ModernBERT generalist (54 families)
  • Train_Gemma3_4B_DGA_WordList.ipynb - Fine-tune Gemma with LoRA
  • Train_llama3B_DGA_WordList.ipynb - Fine-tune LLaMA with LoRA
  • Test_Gemma3_4B_DGA_Last.ipynb - Evaluate Gemma model
  • Test__llama3B_DGA.ipynb - Evaluate LLaMA model
  • DomUrlBert.ipynb - Train DomBertUrl model
  • CNN_Patron_WL.ipynb - Train CNN model
  • FANCI.ipynb - Train FANCI Random Forest
  • Labin_wl.ipynb - Train LABin model

πŸ”¬ Research Methodology

Two-Phase Evaluation Protocol

  1. Phase 1: Known Families Performance

    • Evaluate on 8 training families
    • 30 test batches Γ— 100 domains per family
    • Measures detection accuracy on familiar variants
  2. Phase 2: Generalization Capability

    • Evaluate on 3 unseen wordlist-based families
    • Tests robustness against novel DGA variants
    • Critical for real-world deployment

Evaluation Metrics

  • Precision: Accuracy of DGA predictions
  • Recall: Coverage of actual DGAs
  • F1-Score: Harmonic mean (primary metric)
  • False Positive Rate (FPR): Benign misclassification rate
  • Inference Time: Real-world performance (Tesla T4 GPU)

DGA Families

Training Families (8 wordlist-based):

  • charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox

Generalization Test (3 wordlist-based):

  • bigviktor, ngioweb, pizd

πŸ› οΈ Installation & Requirements

Basic Requirements

pip install torch transformers safetensors

For LLM Models (Gemma/LLaMA)

pip install peft accelerate bitsandbytes

For Traditional Models (FANCI)

pip install scikit-learn joblib

For LABin Model

pip install tensorflow keras

GPU Recommendations

  • Optimal: NVIDIA Tesla T4 or better
  • Minimum: 8GB VRAM for ModernBERT
  • LLMs: 16GB+ VRAM (or use 8-bit quantization)

πŸ“Š Reproducibility

All experiments are fully reproducible:

  1. Download datasets from datasets/ folder
  2. Run training notebooks from notebooks/
  3. Load pre-trained models from models/
  4. Verify reported metrics using test sets

Expected Results (Β±std)

Model Known F1 Unknown F1
ModernBERT 86.7% Β± 3.0% 80.9% Β± 4.5%
Generalist 79.2% Β± 3.5% 62.1% Β± 5.2%

πŸ“– Citation

This work is currently under review. Preliminary citation:

@article{leyva2025expert,
  title={Expert Selection for Wordlist-Based DGA Detection: A Systematic Evaluation},
  author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo},
  journal={Under Review},
  year={2025}
}

πŸ”— Links


πŸ“„ License

This project is licensed under the MIT License - see LICENSE file for details.


πŸ™ Acknowledgments

  • Datasets: DGArchive, 360 Netlab, UMUDga, Tranco
  • Base Models: ModernBERT (Answer.AI), Gemma (Google), LLaMA (Meta)
  • Infrastructure: CONICET Argentina

πŸ” Quick Navigation


Last Updated: October 2025

Downloads last month
185
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support