Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- kenhktsui/FineFineWeb-First100K
|
4 |
+
tags:
|
5 |
+
- fasttext
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
metrics:
|
9 |
+
- f1
|
10 |
+
pipeline_tag: text-classification
|
11 |
+
---
|
12 |
+
# finefineweb-domain-fasttext-classifier
|
13 |
+
|
14 |
+
This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
|
15 |
+
This classifier classifies a text into domains specified in [m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb).
|
16 |
+
The classifier can be used for LLM pretraining data curation, to enhance capability in many domains.
|
17 |
+
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
|
18 |
+
|
19 |
+
Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice.
|
20 |
+
For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced.
|
21 |
+
|
22 |
+
|
23 |
+
## 🛠️Usage
|
24 |
+
```python
|
25 |
+
from typing import List
|
26 |
+
import re
|
27 |
+
from huggingface_hub import hf_hub_download
|
28 |
+
import fasttext
|
29 |
+
|
30 |
+
|
31 |
+
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin"))
|
32 |
+
|
33 |
+
|
34 |
+
def replace_newlines(text: str) -> str:
|
35 |
+
return re.sub("\n+", " ", text)
|
36 |
+
|
37 |
+
|
38 |
+
def predict(text_list):
|
39 |
+
text_list = [replace_newlines(text) for text in text_list]
|
40 |
+
pred = model.predict(text_list)
|
41 |
+
return [{"label": l[0][9:], "score": s[0]}
|
42 |
+
for l, s in zip(*pred)]
|
43 |
+
|
44 |
+
|
45 |
+
predict(
|
46 |
+
[
|
47 |
+
"Arsenal is the best team in the world",
|
48 |
+
"Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.",
|
49 |
+
"Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.",
|
50 |
+
"Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs."
|
51 |
+
]
|
52 |
+
)
|
53 |
+
|
54 |
+
# [{'label': 'sports', 'score': 0.5640762},
|
55 |
+
# {'label': 'economics', 'score': 0.53133816},
|
56 |
+
# {'label': 'physics', 'score': 0.9524484},
|
57 |
+
# {'label': 'computer_science_and_technology', 'score': 0.41515663}]
|
58 |
+
|
59 |
+
```
|
60 |
+
## 📊Evaluation
|
61 |
+
full version
|
62 |
+
```
|
63 |
+
precision recall f1-score support
|
64 |
+
|
65 |
+
aerospace 0.69 0.72 0.71 10000
|
66 |
+
agronomy 0.68 0.74 0.71 10000
|
67 |
+
artistic 0.37 0.24 0.29 10000
|
68 |
+
astronomy 0.67 0.76 0.71 10000
|
69 |
+
atmospheric_science 0.82 0.92 0.87 10000
|
70 |
+
automotive 0.66 0.74 0.70 10000
|
71 |
+
beauty 0.82 0.86 0.84 10000
|
72 |
+
biology 0.44 0.45 0.45 10000
|
73 |
+
celebrity 0.69 0.81 0.75 10000
|
74 |
+
chemistry 0.51 0.49 0.50 10000
|
75 |
+
christianity 0.80 0.84 0.82 10000
|
76 |
+
civil_engineering 0.58 0.58 0.58 10000
|
77 |
+
communication_engineering 0.63 0.67 0.65 10000
|
78 |
+
computer_science_and_technology 0.63 0.59 0.61 10000
|
79 |
+
design 0.51 0.42 0.46 10000
|
80 |
+
drama_and_film 0.53 0.53 0.53 10000
|
81 |
+
economics 0.34 0.26 0.29 10000
|
82 |
+
electronic_science 0.42 0.35 0.38 10000
|
83 |
+
entertainment 0.43 0.29 0.34 10000
|
84 |
+
environmental_science 0.42 0.35 0.38 10000
|
85 |
+
fashion 0.72 0.77 0.74 10000
|
86 |
+
finance 0.49 0.52 0.50 10000
|
87 |
+
food 0.81 0.86 0.83 10000
|
88 |
+
gamble 0.78 0.93 0.85 10000
|
89 |
+
game 0.67 0.67 0.67 10000
|
90 |
+
geography 0.42 0.33 0.37 10000
|
91 |
+
health 0.43 0.29 0.34 10000
|
92 |
+
history 0.64 0.71 0.67 10000
|
93 |
+
hobby 0.45 0.37 0.41 10000
|
94 |
+
hydraulic_engineering 0.95 0.98 0.96 10000
|
95 |
+
instrument_science 0.48 0.50 0.49 10000
|
96 |
+
journalism_and_media_communication 0.26 0.11 0.16 10000
|
97 |
+
landscape_architecture 0.78 0.83 0.80 10000
|
98 |
+
law 0.50 0.55 0.53 10000
|
99 |
+
library 0.53 0.51 0.52 10000
|
100 |
+
literature 0.52 0.53 0.52 10000
|
101 |
+
materials_science 0.49 0.50 0.50 10000
|
102 |
+
mathematics 0.87 0.90 0.88 10000
|
103 |
+
mechanical_engineering 0.48 0.37 0.42 10000
|
104 |
+
medical 0.41 0.42 0.41 10000
|
105 |
+
mining_engineering 0.84 0.93 0.89 10000
|
106 |
+
movie 0.59 0.71 0.64 10000
|
107 |
+
music_and_dance 0.75 0.86 0.80 10000
|
108 |
+
news 0.23 0.13 0.16 10000
|
109 |
+
nuclear_science 0.92 0.96 0.94 10000
|
110 |
+
ocean_science 0.83 0.92 0.88 10000
|
111 |
+
optical_engineering 0.70 0.78 0.74 10000
|
112 |
+
painting 0.91 0.96 0.94 10000
|
113 |
+
pet 0.91 0.95 0.93 10000
|
114 |
+
petroleum_and_natural_gas_engineering 0.92 0.96 0.94 10000
|
115 |
+
philosophy 0.63 0.66 0.64 10000
|
116 |
+
photo 0.80 0.85 0.82 10000
|
117 |
+
physics 0.40 0.35 0.37 10000
|
118 |
+
politics 0.38 0.41 0.39 10000
|
119 |
+
psychology 0.62 0.66 0.64 10000
|
120 |
+
public_administration 0.35 0.33 0.34 10000
|
121 |
+
relationship 0.84 0.88 0.86 10000
|
122 |
+
sociology 0.46 0.50 0.48 10000
|
123 |
+
sports 0.66 0.82 0.73 10000
|
124 |
+
statistics 0.60 0.70 0.65 10000
|
125 |
+
systems_science 0.53 0.53 0.53 10000
|
126 |
+
textile_science 0.81 0.86 0.83 10000
|
127 |
+
topicality 0.97 0.99 0.98 10000
|
128 |
+
transportation_engineering 0.51 0.52 0.51 10000
|
129 |
+
travel 0.68 0.72 0.70 10000
|
130 |
+
urban_planning 0.56 0.62 0.59 10000
|
131 |
+
weapons_science 0.97 0.99 0.98 10000
|
132 |
+
|
133 |
+
accuracy 0.64 670000
|
134 |
+
macro avg 0.62 0.64 0.63 670000
|
135 |
+
weighted avg 0.62 0.64 0.63 670000
|
136 |
+
|
137 |
+
```
|
138 |
+
|
139 |
+
|
140 |
+
## ⚠️Known Limitation
|
141 |
+
The classifier does not handle short text well, which might not be surprising.
|