|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: fill-mask |
|
tags: |
|
- url |
|
- cybersecurity |
|
- urls |
|
- links |
|
- classification |
|
- phishing-detection |
|
- tiny |
|
- phishing |
|
- malware |
|
- defacement |
|
- transformers |
|
- urlbert |
|
- bert |
|
- malicious |
|
- base |
|
- urlbert |
|
--- |
|
|
|
urlbert-tiny-base-v4 is a lightweight BERT-based model specifically optimized for URL analysis. This version includes several improvements over the previous version: |
|
|
|
- Trained using a teacher-student architecture |
|
- Utilized masked token prediction as the primary pre-training task |
|
- Incorporated knowledge distillation from a larger model's logits |
|
- Additional training on 3 specialized tasks to enhance URL structure understanding |
|
|
|
The result is an efficient model that can be rapidly fine-tuned for URL classification tasks with minimal computational resources. |
|
|
|
## Model Details |
|
|
|
- **Parameters:** 3.72M |
|
- **Tensor Type:** F32 |
|
- **Previous Version:** [urlbert-tiny-base-v3](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v3) |
|
|
|
## Usage Example |
|
|
|
```python |
|
from transformers import BertTokenizerFast, BertForMaskedLM, pipeline |
|
import torch |
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
print(f"Device: {device}") |
|
|
|
model_name = "CrabInHoney/urlbert-tiny-base-v4" |
|
|
|
tokenizer = BertTokenizerFast.from_pretrained(model_name) |
|
model = BertForMaskedLM.from_pretrained(model_name) |
|
model.to(device) |
|
|
|
fill_mask = pipeline( |
|
"fill-mask", |
|
model=model, |
|
tokenizer=tokenizer, |
|
device=0 if torch.cuda.is_available() else -1 |
|
) |
|
|
|
sentences = [ |
|
"http://example.[MASK]/" |
|
] |
|
|
|
for sentence in sentences: |
|
print(f"\nInput: {sentence}") |
|
results = fill_mask(sentence) |
|
for result in results: |
|
token_str = result['token_str'] |
|
score = result['score'] |
|
print(f"Predicted token: {token_str}, probability: {score:.4f}") |
|
``` |
|
|
|
### Sample Output |
|
|
|
``` |
|
Input: http://example.[MASK]/ |
|
|
|
Predicted token: com, probability: 0.7307 |
|
Predicted token: net, probability: 0.1319 |
|
Predicted token: org, probability: 0.0881 |
|
Predicted token: info, probability: 0.0094 |
|
Predicted token: cn, probability: 0.0084 |
|
``` |