metadata

license: apache-2.0
language:
  - en
tags:
  - canine
  - character-level
  - mlm
  - domain-names
  - pretrained
datasets:
  - humbleworth/registered-domains
base_model: google/canine-c
model-index:
  - name: domain-mlm-epoch-2
    results:
      - task:
          type: fill-mask
          name: Masked Language Modeling
        dataset:
          name: humbleworth/registered-domains
          type: humbleworth/registered-domains
          split: validation
        metrics:
          - type: perplexity
            value: 3.36
            name: Validation Perplexity

Domain MLM - CANINE Character-Level Model for Domain Names (Epoch 2)

This model is a CANINE-based character-level language model that has been further pre-trained on domain names using masked language modeling (MLM). It's designed to understand and predict patterns in domain names at the character level.

Model Description

This is a checkpoint from epoch 2 of training CANINE-c on domain name data. The model continues pretraining from Google's CANINE-c base model, adapting it specifically to domain name patterns through masked character prediction.

Key Features

Character-level processing: Works directly with Unicode code points, no tokenization required
Domain-specific: Pre-trained on 255M registered domain names
Masked Language Modeling: Trained to predict masked characters in domain names (25% masking probability)
Efficient: 132M parameters, suitable for downstream fine-tuning
Strong Performance: Achieved 3.36 validation perplexity

Architecture

Base Model: google/canine-c (CANINE-S with 132M parameters)
Model Type: CANINE (Character Architecture with No tokenization In Neural Encoders)
Hidden Size: 768
Layers: 12
Attention Heads: 12
Max Position Embeddings: 16,384 (though domains typically use <128)
Vocabulary: Direct Unicode code points (no vocabulary file needed)

Training Details

Training Data: humbleworth/registered-domains dataset (255M domains)
Training Objective: Masked Language Modeling (MLM) with 25% masking probability
Masking Strategy: Mix of contiguous spans (80%) and random characters (20%)
Optimizer: AdamW with learning rate 3e-5, weight decay 0.01
Batch Size: 512 per device with gradient accumulation steps of 3 (effective batch size: 1,536)
Hardware: NVIDIA A100 40GB
Mixed Precision: BF16 automatic mixed precision
Training Framework: PyTorch with custom training loop
Warmup Steps: 2,000
Total Steps: ~830,000 (2 epochs completed at 332,200 steps)
Training Time: ~36 hours for 2 epochs

Performance Metrics

Epoch 2 Results:

Training Loss: 1.29
Training Perplexity: 3.62
Validation Loss: 1.21
Validation Perplexity: 3.36
Best Training Perplexity: 3.49 (achieved during epoch 2)
Processing Speed: 4,037 samples/second
GPU Memory Usage: 2.85 GB (highly optimized)

The model shows excellent convergence, improving from an initial perplexity of 10.08 to 3.36 on validation data. The validation perplexity of 3.36 indicates the model effectively narrows down character predictions to approximately 3-4 likely candidates on average.

Intended Uses & Limitations

Intended Uses

Domain name completion and suggestion
Understanding domain name patterns
Feature extraction for domain-related tasks
Fine-tuning for domain classification tasks
Domain name generation (with additional fine-tuning)
Character-level anomaly detection in domains

Limitations

Primarily trained on ASCII domain names
Limited to domains up to 64 characters (training max_length)
Not suitable for general text understanding tasks
Performance on internationalized domain names (IDN) may be limited
The model has learned strong biases toward common TLDs (.com, .net, .org)

How to Use

Basic Usage

import torch
from transformers import CanineTokenizer, CanineModel, CanineConfig

# Load tokenizer
tokenizer = CanineTokenizer.from_pretrained('humbleworth/domain-mlm')

# Load base CANINE model
config = CanineConfig.from_pretrained('humbleworth/domain-mlm')
model = CanineModel.from_pretrained('humbleworth/domain-mlm')

# Encode a domain
domain = "example.com"
inputs = tokenizer(domain, return_tensors="pt")

# Get character-level embeddings
with torch.no_grad():
    outputs = model(**inputs)
    char_embeddings = outputs.last_hidden_state

For Masked Language Modeling

To use the model for masked character prediction, you'll need to load the custom MLM head:

# Note: You'll need the custom CanineForMaskedLM class from the training code
# The MLM head weights are stored in training_state.bin

import sys
sys.path.append('path/to/training/code')
from train_mlm import CanineForMaskedLM

# Load model with MLM head
model = CanineForMaskedLM(config)
model.canine = CanineModel.from_pretrained('humbleworth/domain-mlm')

# Load MLM head weights
state_dict = torch.load('training_state.bin', map_location='cpu')
model.mlm_head.load_state_dict(state_dict['mlm_head_state_dict'])

# Predict masked characters
masked_domain = "goo[MASK]le.com"  # [MASK] will be replaced with U+E000
# ... prediction code ...

Training Data

The model was trained on the humbleworth/registered-domains dataset, which contains:

Dataset Statistics

Total Size: 255,097,510 unique registered domain names
File Size: 4.1 GB
Source: Domains Project
Character Set: 100% ASCII (no internationalized domains)
Average Length: 15.9 characters (range: 4-77 characters)
Training/Validation Split: 99.9% / 0.1%

TLD Distribution

Total Unique TLDs: 1,274
Top TLDs:
- .com: 139,092,425 (54.5%)
- .net: 12,240,626 (4.8%)
- .de: 11,349,715 (4.4%)
- .org: 10,107,145 (4.0%)
- .nl: 3,739,084 (1.5%)

Domain Characteristics

Domains with numbers: 22,570,972 (8.8%)
Domains with hyphens: 29,207,936 (11.4%)
Character patterns: Lowercase letters, numbers, hyphens, and dots only

This comprehensive dataset provides excellent coverage of real-world domain patterns, making it ideal for training character-level models to understand domain name structures and conventions.

Evaluation

Perplexity Analysis

The model achieved a validation perplexity of 3.36, which means:

The model effectively chooses between ~3.36 possible characters on average at each position
This represents excellent performance for domain name modeling
The low perplexity indicates strong pattern learning, including:
- TLD patterns (high certainty after dots)
- Common domain prefixes and suffixes
- Valid character sequences in domain names

Training Progression

Initial: Loss=2.31, Perplexity=10.08
Epoch 1: ~4.5-5.0 perplexity (estimated)
Epoch 2: Loss=1.21, Perplexity=3.36
Best achieved: Perplexity=3.49 (training), 3.36 (validation)

The model appears to be approaching an asymptotic performance around 3.2-3.5 perplexity, suggesting it has learned most learnable patterns in the domain dataset.

Technical Specifications

Model Architecture

12 transformer layers
768 hidden dimensions
12 attention heads
GELU activation
Layer normalization
Dropout: 0.1

Infrastructure

Trained on NVIDIA A100 40GB GPU
PyTorch 2.0+
Mixed precision training (BF16)
Custom training loop implementation
Gradient clipping: 1.0
Training tracked with Weights & Biases

Citation

If you use this model, please cite:

@misc{domain-mlm-2025,
  title={Domain MLM: Character-Level Language Modeling for Domain Names},
  author={humbleworth},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/humbleworth/domain-mlm}}
}

License

This model is released under the Apache 2.0 license.

Acknowledgments

Based on Google's CANINE-c model
Trained using the humbleworth/registered-domains dataset
Optimized training code for NVIDIA A100 GPUs
Training infrastructure provided by Lambda Labs