Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ library_name: fasttext
|
|
14 |
[Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset)
|
15 |
|
16 |
This classifier classifies a text into Code or NaturalLanguage.
|
17 |
-
The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.
|
18 |
The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
|
19 |
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
|
20 |
|
@@ -50,6 +50,17 @@ predict([
|
|
50 |
# {'label': 'Code', 'score': 1.00001},
|
51 |
# {'label': 'Code', 'score': 1.000009}]
|
52 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
|
55 |
## 📝Definition of Label
|
|
|
14 |
[Dataset](https://huggingface.co/datasets/kenhktsui/code-natural-language-classification-dataset)
|
15 |
|
16 |
This classifier classifies a text into Code or NaturalLanguage.
|
17 |
+
The model is trained over 3.24M records, which is a mix of code and natural langauge and achieved a test F1 score of 0.97.
|
18 |
The classifier can be used for LLM pretraining data curation, to route a text into different pipeline (e.g. code syntax check).
|
19 |
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
|
20 |
|
|
|
50 |
# {'label': 'Code', 'score': 1.00001},
|
51 |
# {'label': 'Code', 'score': 1.000009}]
|
52 |
```
|
53 |
+
## 📊Evaluation
|
54 |
+
```
|
55 |
+
precision recall f1-score support
|
56 |
+
|
57 |
+
Code 0.97 1.00 0.98 581282
|
58 |
+
NaturalLanguage 1.00 0.92 0.95 228993
|
59 |
+
|
60 |
+
accuracy 0.98 810275
|
61 |
+
macro avg 0.98 0.96 0.97 810275
|
62 |
+
weighted avg 0.98 0.98 0.98 810275
|
63 |
+
```
|
64 |
|
65 |
|
66 |
## 📝Definition of Label
|