urlbert
					Collection
				
A collection of bert-based models for URL analysis
					• 
				10 items
				• 
				Updated
					
				•
					
					2
This is a lightweight version of BERT, specifically fine-tuned for classifying URLs into four categories: benign, phishing, malware, and defacement.
The model was evaluated on a test set with the following classification metrics:
| Class | Precision | Recall | F1-Score | 
|---|---|---|---|
| Benign | 0.987695 | 0.993717 | 0.990697 | 
| Defacement | 0.988510 | 0.998963 | 0.993709 | 
| Malware | 0.988291 | 0.960332 | 0.974111 | 
| Phishing | 0.958425 | 0.930826 | 0.944423 | 
| Accuracy | 0.983738 | 0.983738 | 0.983738 | 
| Macro Avg | 0.980730 | 0.970959 | 0.975735 | 
| Weighted Avg | 0.983615 | 0.983738 | 0.983627 | 
Below is an example of how to use the model for URL classification using the Hugging Face transformers library:
from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
import torch
# Определение устройства (GPU или CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Используемое устройство: {device}")
# Загрузка модели и токенизатора
model_name = "CrabInHoney/urlbert-tiny-v3-malicious-url-classifier"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.to(device)
# Создание pipeline для классификации
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    return_all_scores=True
)
# Примеры URL для тестирования
test_urls = [
    "wikiobits.com/Obits/TonyProudfoot",
    "http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb",
]
# Маппинг меток на понятные названия классов
label_mapping = {
    "LABEL_0": "benign",
    "LABEL_1": "defacement",
    "LABEL_2": "malware",
    "LABEL_3": "phishing"
}
# Классификация URL
for url in test_urls:
    results = classifier(url)
    print(f"\nURL: {url}")
    for result in results[0]: 
        label = result['label']
        score = result['score']
        friendly_label = label_mapping.get(label, label)
        print(f"Класс: {friendly_label}, вероятность: {score:.4f}")
URL: wikiobits.com/Obits/TonyProudfoot
Класс: benign, вероятность: 0.9953
Класс: defacement, вероятность: 0.0000
Класс: malware, вероятность: 0.0000
Класс: phishing, вероятность: 0.0046
URL: http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb
Класс: benign, вероятность: 0.0000
Класс: defacement, вероятность: 0.0001
Класс: malware, вероятность: 0.9998
Класс: phishing, вероятность: 0.0001
Base model
CrabInHoney/urlbert-tiny-base-v3