---
library_name: transformers
tags:
- cyber
- cybersecurity
- code
- T5
license: mit
base_model:
- google/flan-t5-large
pipeline_tag: text2text-generation
datasets:
- Anvilogic/T5-Typosquat-Training-Dataset
---

# Typosquat T5 detector

## Model Details

### Model Description
This model is an encoder-decoder fine-tuned for to detect typosquatting of domain names, leveraging the [flan-t5-large](https://huggingface.co/google/flan-t5-large) transformer model.
The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

- **Developed by:** Anvilogic
- **Model type:** Encoder-Decoder 
- **Maximum Sequence Length**: 512 tokens
- **Language(s) (NLP):** Multilingual
- **License:** MIT
- **Finetuned from model :** [flan-t5-large](https://huggingface.co/google/flan-t5-large)

## Usage


### Direct Usage (Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")
model = AutoModelForSeq2SeqLM.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")

# Example input
typosquat_candidate='goog1e.com'
legitimate_domain='google.com'

input_text = f"Is the first domain a typosquat of the second: {typosquat_candidate} {legitimate_domain}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

false
```

### Downstream Usage
This model can be used with an embedding model to enhance typosquatting detection.
First, an embedding model retrieves similar domains from a legitimate database.
Then, this encoder-decoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect)

## Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security.
Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

## Training Details

### Framework Versions
- Python: 3.10.14
- Transformers: 4.46.2
- PyTorch: 2.2.2
- Tokenizers: 0.20.3

### Training Data

The model was fine-tuned using [Anvilogic/T5-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/T5-Typosquat-Training-Dataset), which contains pairs of domain names and the expected response.

### Training Procedure
The model was optimized using the binary cross-entropy loss function with logits, `CrossEntropyLoss()`.

#### Training Hyperparameters
- **Model Architecture**: Encoder-Decoder fine-tuned from [flan-t5-large](https://huggingface.co/google/flan-t5-large)
- **Batch Size**: 8
- **Epochs**: 5
- **Learning Rate**: 5e-5


## Evaluation

**Training loss**

|  Epoch  | Training loss | Validation loss |
|---------|---------------|-----------------|
| Epoch 1 |    0.0807     |    0.016496     |
| Epoch 2 |    0.0270     |    0.018645     |
| Epoch 3 |    0.0034     |    0.016577     |
| Epoch 4 |    0.0002     |    0.012842     |
| Epoch 5 |    0.0407     |    0.014530     |

We only kept the fourth checkpoint as it exhibits the best loss.