Reynier
/

moe-wordlist-dga-models

+---
+language: en
+license: mit
+tags:
+  - dga-detection
+  - cybersecurity
+  - domain-generation-algorithm
+  - wordlist-dga
+  - bert
+  - mixture-of-experts
+  - security
+datasets:
+  - wordlist-dga-train-160k
+metrics:
+  - f1
+  - precision
+  - recall
+library_name: transformers
+pipeline_tag: text-classification
+---
+# Expert Models for Wordlist-Based DGA Detection
+This repository contains the **complete collection of models, datasets, and evaluation notebooks** from the research paper:
+**"Expert Selection for Wordlist-Based DGA Detection"** (Currently Under Review)
+*Reynier Leyva La O, Carlos A. Catania, and Rodrigo Gonzalez*
+---
+## 🎯 Overview
+This work presents a systematic evaluation of **seven expert model candidates** for detecting wordlist-based Domain Generation Algorithms (DGAs). Through rigorous two-phase evaluation, **ModernBERT** was identified as the optimal expert model, achieving:
+- **86.7% F1-score** on known DGA families
+- **80.9% F1-score** on previously unseen families
+- **26ms inference time** on NVIDIA Tesla T4 GPU (~38 domains/second)
+- **9.4% improvement** over generalist approaches on known families
+- **30.2% improvement** over generalist approaches on unknown families
+---
+## 📦 Repository Contents
+```
+moe-wordlist-dga-models/
+│
+├── models/
+│   ├── modernbert-wordlist-expert/        ⭐ Optimal model (8 wordlist families)
+│   ├── modernbert-generalist-54f/         📊 Generalist baseline (54 families)
+│   ├── dombert-url/                       🔬 Domain-URL BERT
+│   ├── gemma-3-4b-lora/                   🤖 Gemma 3 4B LoRA adapters
+│   ├── llama-3.2-3b-lora/                 🤖 LLaMA 3.2 3B LoRA adapters
+│   ├── cnn-wordlist/                      ⚡ Character-level CNN
+│   ├── fanci/                             🔧 FANCI Random Forest
+│   └── labin/                             🔧 LA Bin07 hybrid
+│
+├── datasets/
+│   ├── train_wl.csv                       📊 Training set (160K domains)
+│   └── test_sets/                         🧪 Test sets (in-family + generalization)
+│
+├── notebooks/                              📓 All training & evaluation notebooks
+│
+└── scripts/                                🐍 Inference & evaluation scripts
+```
+---
+## 🚀 Quick Start
+### Option 1: Use the Optimal Model (Recommended)
+The **ModernBERT wordlist expert** is the best-performing model and easiest to use:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "Reynier/moe-wordlist-dga-models"
+subfolder = "models/modernbert-wordlist-expert"
+tokenizer = AutoTokenizer.from_pretrained(model_name, subfolder=subfolder)
+model = AutoModelForSequenceClassification.from_pretrained(model_name, subfolder=subfolder)
+# Classify a domain
+domain = "secure-banking-portal.com"
+inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128)
+outputs = model(**inputs)
+prediction = torch.softmax(outputs.logits, dim=1)
+print(f"Benign: {prediction[0][0]:.4f}")
+print(f"DGA:    {prediction[0][1]:.4f}")
+```
+### Option 2: Download Specific Models
+```python
+from huggingface_hub import snapshot_download
+# Download only ModernBERT expert
+snapshot_download(
+    repo_id="Reynier/moe-wordlist-dga-models",
+    allow_patterns="models/modernbert-wordlist-expert/*",
+    local_dir="./models"
+)
+# Download all models
+snapshot_download(
+    repo_id="Reynier/moe-wordlist-dga-models",
+    local_dir="./complete-repo"
+)
+```
+### Option 3: Clone Entire Repository
+```bash
+git lfs install
+git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
+cd moe-wordlist-dga-models
+```
+---
+## 📊 Model Performance Comparison
+### Known Families (n=8)
+| Model | Precision | Recall | F1-Score | FPR | Inference Time |
+|-------|-----------|--------|----------|-----|----------------|
+| **ModernBERT** ⭐ | 89.7 ± 4.1% | **86.6 ± 3.1%** | **86.7 ± 3.0%** | 9.0 ± 3.8% | **26ms** |
+| LA_Bin07 | 84.6 ± 5.9% | 82.3 ± 3.3% | 81.7 ± 3.8% | 12.0 ± 5.9% | 80ms |
+| CNN | 80.9 ± 5.7% | 80.0 ± 4.1% | 78.9 ± 4.0% | 15.3 ± 5.5% | <1ms |
+| Gemma 3 4B | **95.4 ± 3.6%** | 66.5 ± 5.7% | 75.2 ± 4.8% | **2.5 ± 2.2%** | 1413ms |
+| DomBertUrl | 81.2 ± 6.4% | 69.0 ± 6.9% | 72.4 ± 5.8% | 12.8 ± 5.0% | 13ms |
+| FANCI | 70.3 ± 4.8% | 72.7 ± 5.4% | 70.5 ± 4.9% | 27.6 ± 5.5% | 310ms |
+| LLaMA 3.2 3B | 92.4 ± 5.5% | 41.9 ± 8.8% | 54.7 ± 8.8% | 2.9 ± 2.2% | 656ms |
+### Unknown Families (n=3) - Generalization Test
+| Model | Precision | Recall | F1-Score | FPR | Inference Time |
+|-------|-----------|--------|----------|-----|----------------|
+| **DomBertUrl** | 87.7 ± 4.2% | **82.3 ± 4.5%** | **84.6 ± 3.5%** | 11.5 ± 4.3% | **13ms** |
+| **ModernBERT** ⭐ | 89.0 ± 4.4% | 75.5 ± 5.6% | 80.9 ± 4.5% | 9.1 ± 4.1% | 35ms |
+| Gemma 3 4B | **95.7 ± 4.4%** | 60.3 ± 5.9% | 70.8 ± 5.0% | **2.2 ± 2.1%** | 1390ms |
+| CNN | 76.9 ± 6.9% | 60.2 ± 4.9% | 65.5 ± 5.3% | 15.9 ± 5.4% | <1ms |
+| LLaMA 3.2 3B | 60.5 ± 4.4% | 68.8 ± 4.9% | 63.4 ± 4.2% | 39.8 ± 5.8% | 693ms |
+| LA_Bin07 | 73.0 ± 9.1% | 45.7 ± 5.3% | 53.7 ± 5.7% | 14.1 ± 5.6% | 80ms |
+| FANCI | 51.8 ± 7.6% | 32.0 ± 6.5% | 39.1 ± 6.7% | 27.6 ± 5.5% | 284ms |
+> **Note:** Metrics reported as mean ± standard deviation across 30 randomized batches per family. Inference times measured on NVIDIA Tesla T4 GPU.
+---
+## 🔬 Specialist vs. Generalist Validation
+Direct comparison of **specialist training** (8 wordlist families only) vs. **generalist training** (54 diverse families):
+| Scenario | Specialist F1 | Generalist F1 | Improvement |
+|----------|---------------|---------------|-------------|
+| Known families | **86.7%** | 79.2% | **+9.4%** |
+| Unknown families | **80.9%** | 62.1% | **+30.2%** |
+This demonstrates that **domain-specific expert training significantly outperforms broad exposure** to diverse DGA types.
+---
+## 💻 Using Individual Models
+### ModernBERT Generalist (54 families)
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained(
+    "Reynier/moe-wordlist-dga-models",
+    subfolder="models/modernbert-generalist-54f"
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "Reynier/moe-wordlist-dga-models",
+    subfolder="models/modernbert-generalist-54f"
+)
+```
+### Gemma 3 4B with LoRA Adapters
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "google/gemma-3-4b-it",
+    device_map="auto"
+)
+# Load LoRA adapters from this repo
+model = PeftModel.from_pretrained(
+    base_model,
+    "Reynier/moe-wordlist-dga-models",
+    subfolder="models/gemma-3-4b-lora"
+)
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
+```
+### LLaMA 3.2 3B with LoRA Adapters
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-3B-Instruct",
+    device_map="auto"
+)
+# Load LoRA adapters
+model = PeftModel.from_pretrained(
+    base_model,
+    "Reynier/moe-wordlist-dga-models",
+    subfolder="models/llama-3.2-3b-lora"
+)
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
+```
+### FANCI Random Forest
+```python
+import pickle
+from huggingface_hub import hf_hub_download
+# Download model
+model_path = hf_hub_download(
+    repo_id="Reynier/moe-wordlist-dga-models",
+    filename="models/fanci/fanci_model.pkl"
+)
+# Load
+with open(model_path, 'rb') as f:
+    model = pickle.load(f)
+# Use feature extractor (included in repo)
+from models.fanci.feature_extractor import extract_features
+features = extract_features(domain)
+prediction = model.predict([features])
+```
+---
+## 📊 Dataset Information
+### Training Dataset
+- **Total samples:** 160,000 (balanced 50/50)
+- **DGA samples:** 80,000 from 8 wordlist-based families
+- **Benign samples:** 80,000 from Tranco top sites
+**DGA Families (Training):**
+- charbot (10,000 samples)
+- deception (10,000 samples)
+- gozi (10,000 samples)
+- manuelita (10,000 samples)
+- matsnu (10,000 samples)
+- nymaim (10,000 samples)
+- rovnix (10,000 samples)
+- suppobox (10,000 samples)
+**Generalization Test Families (Unknown):**
+- bigviktor (1,500 samples)
+- ngioweb (1,500 samples)
+- pizd (1,500 samples)
+### Dataset Format
+```csv
+domain,family,label
+secure-banking-portal.com,suppobox,1
+google.com,benign,0
+random-check-system.net,matsnu,1
+```
+---
+## 📓 Reproducing Paper Results
+All training and evaluation notebooks are included in the `notebooks/` directory:
+1. **Clone this repository:**
+   ```bash
+   git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
+   cd moe-wordlist-dga-models/notebooks
+   ```
+2. **Install dependencies:**
+   ```bash
+   pip install torch transformers scikit-learn pandas numpy matplotlib seaborn jupyter
+   ```
+3. **Run notebooks:**
+   ```bash
+   jupyter notebook ModernBERT_base_DGA_Word.ipynb
+   ```
+### Available Notebooks
+- `ModernBERT_base_DGA_Word.ipynb` - Optimal expert training
+- `ModernBERT_base_DGA_54F.ipynb` - Generalist baseline
+- `Train_Gemma3_4B_DGA_WordList.ipynb` - Gemma LoRA training
+- `Train_llama3B_DGA_WordList.ipynb` - LLaMA LoRA training
+- `DomUrlBert.ipynb` - DomBertUrl training
+- `CNN_Patron_WL.ipynb` - CNN training
+- `FANCI.ipynb` - Random Forest baseline
+- `Labin_wl.ipynb` - LA Bin07 hybrid
+---
+## 🔍 Inference Scripts
+Ready-to-use Python scripts are available in `scripts/`:
+```bash
+# Classify single domain with optimal model
+python scripts/classify_domain.py "secure-banking-portal.com"
+# Batch classification from CSV
+python scripts/batch_classify.py --input domains.csv --output results.csv
+# Compare all models
+python scripts/compare_all_models.py --domain "test-domain.com"
+```
+---
+## 🎓 Citation
+```bibtex
+@article{leyva2025expert,
+  title={Expert Selection for Wordlist-Based DGA Detection},
+  author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo},
+  journal={Under Review},
+  year={2025}
+}
+```
+---
+## 📄 License
+MIT License - See LICENSE file for details
+---
+## 🤝 Contributing & Contact
+For questions regarding model usage or experimental reproducibility:
+- **Email:** [email protected]
+- **GitHub:** https://github.com/reypapin/MoE-word-list-dga-detection
+- **Issues:** Open an issue on GitHub for technical questions
+---
+## 🙏 Acknowledgments
+- **Hardware:** NVIDIA Tesla T4 GPUs provided by Google Colab
+- **Datasets:** DGArchive, 360 Netlab, UMUDga repositories, Tranco list
+- **Base Models:** Answer.AI (ModernBERT), Google (Gemma), Meta (LLaMA)
+- **Funding:** National Scientific and Technical Research Council (CONICET), Argentina