--- library_name: transformers tags: - cyber - cybersecurity - code - T5 license: mit base_model: - google/flan-t5-large pipeline_tag: text2text-generation datasets: - Anvilogic/T5-Typosquat-Training-Dataset --- # Typosquat T5 detector ## Model Details ### Model Description This model is an encoder-decoder fine-tuned for to detect typosquatting of domain names, leveraging the [flan-t5-large](https://huggingface.co/google/flan-t5-large) transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain. - **Developed by:** Anvilogic - **Model type:** Encoder-Decoder - **Maximum Sequence Length**: 512 tokens - **Language(s) (NLP):** Multilingual - **License:** MIT - **Finetuned from model :** [flan-t5-large](https://huggingface.co/google/flan-t5-large) ## Usage ### Direct Usage (Transformers) This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one. To start using this model, the following code can be used for loading and testing: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("Anvilogic/Flan-T5-typosquat-detect") model = AutoModelForSeq2SeqLM.from_pretrained("Anvilogic/Flan-T5-typosquat-detect") # Example input typosquat_candidate='goog1e.com' legitimate_domain='google.com' input_text = f"Is the first domain a typosquat of the second: {typosquat_candidate} {legitimate_domain}" input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0],skip_special_tokens=True)) false ``` ### Downstream Usage This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, this encoder-decoder labels these pairs, confirming if a domain is a typosquat and identifying its original source. For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect) ## Bias, Risks, and Limitations Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing. ## Training Details ### Framework Versions - Python: 3.10.14 - Transformers: 4.46.2 - PyTorch: 2.2.2 - Tokenizers: 0.20.3 ### Training Data The model was fine-tuned using [Anvilogic/T5-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/T5-Typosquat-Training-Dataset), which contains pairs of domain names and the expected response. ### Training Procedure The model was optimized using the binary cross-entropy loss function with logits, `CrossEntropyLoss()`. #### Training Hyperparameters - **Model Architecture**: Encoder-Decoder fine-tuned from [flan-t5-large](https://huggingface.co/google/flan-t5-large) - **Batch Size**: 8 - **Epochs**: 5 - **Learning Rate**: 5e-5 ## Evaluation **Training loss** | Epoch | Training loss | Validation loss | |---------|---------------|-----------------| | Epoch 1 | 0.0807 | 0.016496 | | Epoch 2 | 0.0270 | 0.018645 | | Epoch 3 | 0.0034 | 0.016577 | | Epoch 4 | 0.0002 | 0.012842 | | Epoch 5 | 0.0407 | 0.014530 | We only kept the fourth checkpoint as it exhibits the best loss.