metadata

library_name: transformers
tags:
  - cyber
  - cybersecurity
  - code
  - T5
license: mit
base_model:
  - google/flan-t5-large
pipeline_tag: text2text-generation
datasets:
  - Anvilogic/T5-Typosquat-Training-Dataset

Typosquat T5 detector

Model Details

Model Description

This model is an encoder-decoder fine-tuned for to detect typosquatting of domain names, leveraging the flan-t5-large transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

Developed by: Anvilogic
Model type: Encoder-Decoder
Maximum Sequence Length: 512 tokens
Language(s) (NLP): Multilingual
License: MIT
Finetuned from model : flan-t5-large

Usage

Direct Usage (Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")
model = AutoModelForSeq2SeqLM.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")

# Example input
typosquat_candidate='goog1e.com'
legitimate_domain='google.com'

input_text = f"Is the first domain a typosquat of the second: {typosquat_candidate} {legitimate_domain}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

false

Downstream Usage

This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, this encoder-decoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: Anvilogic/Embedder-typosquat-detect

Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

Training Details

Framework Versions

Python: 3.10.14
Transformers: 4.46.2
PyTorch: 2.2.2
Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/T5-Typosquat-Training-Dataset, which contains pairs of domain names and the expected response.

Training Procedure

The model was optimized using the binary cross-entropy loss function with logits, CrossEntropyLoss().

Training Hyperparameters

Model Architecture: Encoder-Decoder fine-tuned from flan-t5-large
Batch Size: 8
Epochs: 5
Learning Rate: 5e-5

Evaluation

Training loss

Epoch	Training loss	Validation loss
Epoch 1	0.0807	0.016496
Epoch 2	0.0270	0.018645
Epoch 3	0.0034	0.016577
Epoch 4	0.0002	0.012842
Epoch 5	0.0407	0.014530

We only kept the fourth checkpoint as it exhibits the best loss.