File size: 6,195 Bytes
aecd093
9a6c63e
775c469
 
172647d
 
 
 
 
0de9fd5
a2fad04
45919db
3e45947
172647d
 
aecd093
9a6c63e
54d1907
d28bf1d
45919db
a78e93c
172647d
138986c
d28bf1d
19b55c2
a78e93c
c27dd09
d28bf1d
 
54d1907
9a6c63e
19b55c2
 
44f3b19
 
 
 
 
9a6c63e
54d1907
9a6c63e
d28bf1d
 
5ea7461
9a6c63e
172647d
 
3bf1ffc
172647d
3bf1ffc
172647d
 
61307ce
3bf1ffc
a78e93c
9a6c63e
672fe2d
 
 
 
a78e93c
672fe2d
 
 
a78e93c
672fe2d
 
9a6c63e
a78e93c
80fba6b
 
9a6c63e
a78e93c
9a6c63e
 
a78e93c
9a6c63e
a78e93c
 
 
9a6c63e
d28bf1d
172647d
 
d28bf1d
61307ce
172647d
61307ce
d28bf1d
 
 
 
ab65f57
172647d
 
 
 
 
 
 
80fba6b
 
172647d
 
 
 
 
 
 
 
d28bf1d
172647d
 
 
 
 
 
 
 
 
 
 
 
 
d28bf1d
 
 
cbfee4d
 
 
 
172647d
cbfee4d
 
 
 
172647d
 
 
 
 
 
 
 
 
 
 
 
 
 
80fba6b
 
172647d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbfee4d
d28bf1d
61307ce
 
 
172647d
61307ce
 
 
 
 
 
 
 
 
 
 
 
80fba6b
 
61307ce
 
 
 
 
 
 
 
 
 
 
 
cbfee4d
172647d
19b55c2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
license: mit
library_name: sklearn
tags:
  - text-classification
  - sklearn
  - phishing
  - url
  - onnx
model_format: pickle
model_file: model.pkl
inference: false
pipeline_tag: text-classification
datasets:
  - pirocheto/phishing-url
---

# Model Description

The model predicts the probability that a URL is a phishing site.  
To understand what phishing is, refer to the Wikipedia page:  
[https://en.wikipedia.org/wiki/Phishing](https://en.wikipedia.org/wiki/Phishing) 
-- this is not a phishing link 😜

- **Model type:** LinearSVM
- **Task:** Binary classification
- **License:** MIT
- **Repository:** https://github.com/pirocheto/phishing-url-detection

## Evaluation

| Metric    |    Value |
|-----------|----------|
| roc_auc   | 0.986844 |
| accuracy  | 0.948568 |
| f1        | 0.948623 |
| precision | 0.947619 |
| recall    | 0.949629 |

# How to Get Started with the Model

Using pickle in Python is discouraged due to security risks during data deserialization, potentially allowing code injection.
It lacks portability across Python versions and interoperability with other languages.
Read more about this subject in the [Hugging Face Documentation](https://huggingface.co/docs/hub/security-pickle).

Instead, we recommend using the ONNX model, which is more secure.
In addition to being lighter and faster, it can be utilized by languages supported by the [ONNX runtime](https://onnxruntime.ai/docs/get-started/).

Below are some examples to get you start. For others languages please refer to the ONNX documentation

<details>
  <summary><b>Python</b> - ONNX - [recommended πŸ‘]</summary>

```python
import numpy as np
import onnxruntime
from huggingface_hub import hf_hub_download

REPO_ID = "pirocheto/phishing-url-detection"
FILENAME = "model.onnx"
model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)

# Initializing the ONNX Runtime session with the pre-trained model
sess = onnxruntime.InferenceSession(
    model_path,
    providers=["CPUExecutionProvider"],
)

urls = [
    "https://clubedemilhagem.com/home.php",
    "http://www.medicalnewstoday.com/articles/188939.php",
]
inputs = np.array(urls, dtype="str")

# Using the ONNX model to make predictions on the input data
results = sess.run(None, {"inputs": inputs})[1]

for url, proba in zip(urls, results):
    print(f"URL: {url}")
    print(f"Likelihood of being a phishing site: {proba[1] * 100:.2f} %")
    print("----")

```
</details>

<details>
  <summary><b>NodeJS</b>- ONNX - [recommended πŸ‘]</summary>

```javascript
const ort = require('onnxruntime-node');

async function main() {
    
    try {
        // Make sure you have downloaded the model.onnx
        // Creating an ONNX inference session with the specified model
        const model_path = "./model.onnx";
        const session = await ort.InferenceSession.create(model_path);

        const urls = [
            "https://clubedemilhagem.com/home.php",
            "http://www.medicalnewstoday.com/articles/188939.php",
        ]
        
        // Creating an ONNX tensor from the input data
        const tensor = new ort.Tensor('string', urls, [urls.length,]);
        
        // Executing the inference session with the input tensor
        const results = await session.run({"inputs": tensor});
        const probas = results['probabilities'].data;
        
        // Displaying results for each URL
        urls.forEach((url, index) => {
            const proba = probas[index * 2 + 1];
            const percent = (proba * 100).toFixed(2);
            
            console.log(`URL: ${url}`);
            console.log(`Likelihood of being a phishing site: ${percent}%`);
            console.log("----");
        });

    } catch (e) {
        console.log(`failed to inference ONNX model: ${e}.`);
    }
};

main();
```
</details>

<details>
  <summary><b>JavaScript</b> - ONNX - [recommended πŸ‘]</summary>

```html
<!DOCTYPE html>
<html>
  <header>
    <title>Get Started with JavaScript</title>
  </header>
  <body>
    <!-- import ONNXRuntime Web from CDN -->
    <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
    <script>
      // use an async context to call onnxruntime functions.
      async function main() {
        try {
          const model_path = "./model.onnx";
          const session = await ort.InferenceSession.create(model_path);

          const urls = [
          "https://clubedemilhagem.com/home.php",
          "http://www.medicalnewstoday.com/articles/188939.php",
          ];

          // Creating an ONNX tensor from the input data
          const tensor = new ort.Tensor("string", urls, [urls.length]);

          // Executing the inference session with the input tensor
          const results = await session.run({ inputs: tensor });
          const probas = results["probabilities"].data;

          // Displaying results for each URL
          urls.forEach((url, index) => {
            const proba = probas[index * 2 + 1];
            const percent = (proba * 100).toFixed(2);

            document.write(`URL: ${url} <br>`);
            document.write(
              `Likelihood of being a phishing site: ${percent} % <br>`
            );
            document.write("---- <br>");
          });
        } catch (e) {
          document.write(`failed to inference ONNX model: ${e}.`);
        }
      }
      main();
    </script>
  </body>
</html>
```
</details>

<details>
  <summary><b>Python</b> - Pickle - [not recommended ⚠️]</summary>

```python
import joblib
from huggingface_hub import hf_hub_download

REPO_ID = "pirocheto/phishing-url-detection"
FILENAME = "model.pkl"

# Download the model from the Hugging Face Model Hub
model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)

urls = [
    "https://clubedemilhagem.com/home.php",
    "http://www.medicalnewstoday.com/articles/188939.php",
]

# Load the downloaded model using joblib
model = joblib.load(model_path)

# Predict probabilities for each URL
probas = model.predict_proba(urls)

for url, proba in zip(urls, probas):
    print(f"URL: {url}")
    print(f"Likelihood of being a phishing site: {proba[1] * 100:.2f} %")
    print("----")

```
</details>