anvilogic-admin's picture
Update README.md
bb3aa5c verified
metadata
library_name: transformers
tags:
  - cyber
  - cybersecurity
  - code
  - T5
license: mit
base_model:
  - google/flan-t5-large
pipeline_tag: text2text-generation
datasets:
  - Anvilogic/T5-Typosquat-Training-Dataset

Typosquat T5 detector

Model Details

Model Description

This model is an encoder-decoder fine-tuned for to detect typosquatting of domain names, leveraging the flan-t5-large transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

  • Developed by: Anvilogic
  • Model type: Encoder-Decoder
  • Maximum Sequence Length: 512 tokens
  • Language(s) (NLP): Multilingual
  • License: MIT
  • Finetuned from model : flan-t5-large

Usage

Direct Usage (Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")
model = AutoModelForSeq2SeqLM.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")

# Example input
typosquat_candidate='goog1e.com'
legitimate_domain='google.com'

input_text = f"Is the first domain a typosquat of the second: {typosquat_candidate} {legitimate_domain}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

false

Downstream Usage

This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, this encoder-decoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: Anvilogic/Embedder-typosquat-detect

Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

Training Details

Framework Versions

  • Python: 3.10.14
  • Transformers: 4.46.2
  • PyTorch: 2.2.2
  • Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/T5-Typosquat-Training-Dataset, which contains pairs of domain names and the expected response.

Training Procedure

The model was optimized using the binary cross-entropy loss function with logits, CrossEntropyLoss().

Training Hyperparameters

  • Model Architecture: Encoder-Decoder fine-tuned from flan-t5-large
  • Batch Size: 8
  • Epochs: 5
  • Learning Rate: 5e-5

Evaluation

Training loss

Epoch Training loss Validation loss
Epoch 1 0.0807 0.016496
Epoch 2 0.0270 0.018645
Epoch 3 0.0034 0.016577
Epoch 4 0.0002 0.012842
Epoch 5 0.0407 0.014530

We only kept the fourth checkpoint as it exhibits the best loss.