kenhktsui commited on
Commit
c4a01ff
·
verified ·
1 Parent(s): 48fb42d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - kenhktsui/FineFineWeb-First100K
4
+ tags:
5
+ - fasttext
6
+ language:
7
+ - en
8
+ metrics:
9
+ - f1
10
+ pipeline_tag: text-classification
11
+ ---
12
+ # finefineweb-domain-fasttext-classifier
13
+
14
+ This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
15
+ This classifier classifies a text into domains specified in [m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb).
16
+ The classifier can be used for LLM pretraining data curation, to enhance capability in many domains.
17
+ It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
18
+
19
+ Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice.
20
+ For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced.
21
+
22
+
23
+ ## 🛠️Usage
24
+ ```python
25
+ from typing import List
26
+ import re
27
+ from huggingface_hub import hf_hub_download
28
+ import fasttext
29
+
30
+
31
+ model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin"))
32
+
33
+
34
+ def replace_newlines(text: str) -> str:
35
+ return re.sub("\n+", " ", text)
36
+
37
+
38
+ def predict(text_list):
39
+ text_list = [replace_newlines(text) for text in text_list]
40
+ pred = model.predict(text_list)
41
+ return [{"label": l[0][9:], "score": s[0]}
42
+ for l, s in zip(*pred)]
43
+
44
+
45
+ predict(
46
+ [
47
+ "Arsenal is the best team in the world",
48
+ "Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.",
49
+ "Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.",
50
+ "Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs."
51
+ ]
52
+ )
53
+
54
+ # [{'label': 'sports', 'score': 0.5640762},
55
+ # {'label': 'economics', 'score': 0.53133816},
56
+ # {'label': 'physics', 'score': 0.9524484},
57
+ # {'label': 'computer_science_and_technology', 'score': 0.41515663}]
58
+
59
+ ```
60
+ ## 📊Evaluation
61
+ full version
62
+ ```
63
+ precision recall f1-score support
64
+
65
+ aerospace 0.69 0.72 0.71 10000
66
+ agronomy 0.68 0.74 0.71 10000
67
+ artistic 0.37 0.24 0.29 10000
68
+ astronomy 0.67 0.76 0.71 10000
69
+ atmospheric_science 0.82 0.92 0.87 10000
70
+ automotive 0.66 0.74 0.70 10000
71
+ beauty 0.82 0.86 0.84 10000
72
+ biology 0.44 0.45 0.45 10000
73
+ celebrity 0.69 0.81 0.75 10000
74
+ chemistry 0.51 0.49 0.50 10000
75
+ christianity 0.80 0.84 0.82 10000
76
+ civil_engineering 0.58 0.58 0.58 10000
77
+ communication_engineering 0.63 0.67 0.65 10000
78
+ computer_science_and_technology 0.63 0.59 0.61 10000
79
+ design 0.51 0.42 0.46 10000
80
+ drama_and_film 0.53 0.53 0.53 10000
81
+ economics 0.34 0.26 0.29 10000
82
+ electronic_science 0.42 0.35 0.38 10000
83
+ entertainment 0.43 0.29 0.34 10000
84
+ environmental_science 0.42 0.35 0.38 10000
85
+ fashion 0.72 0.77 0.74 10000
86
+ finance 0.49 0.52 0.50 10000
87
+ food 0.81 0.86 0.83 10000
88
+ gamble 0.78 0.93 0.85 10000
89
+ game 0.67 0.67 0.67 10000
90
+ geography 0.42 0.33 0.37 10000
91
+ health 0.43 0.29 0.34 10000
92
+ history 0.64 0.71 0.67 10000
93
+ hobby 0.45 0.37 0.41 10000
94
+ hydraulic_engineering 0.95 0.98 0.96 10000
95
+ instrument_science 0.48 0.50 0.49 10000
96
+ journalism_and_media_communication 0.26 0.11 0.16 10000
97
+ landscape_architecture 0.78 0.83 0.80 10000
98
+ law 0.50 0.55 0.53 10000
99
+ library 0.53 0.51 0.52 10000
100
+ literature 0.52 0.53 0.52 10000
101
+ materials_science 0.49 0.50 0.50 10000
102
+ mathematics 0.87 0.90 0.88 10000
103
+ mechanical_engineering 0.48 0.37 0.42 10000
104
+ medical 0.41 0.42 0.41 10000
105
+ mining_engineering 0.84 0.93 0.89 10000
106
+ movie 0.59 0.71 0.64 10000
107
+ music_and_dance 0.75 0.86 0.80 10000
108
+ news 0.23 0.13 0.16 10000
109
+ nuclear_science 0.92 0.96 0.94 10000
110
+ ocean_science 0.83 0.92 0.88 10000
111
+ optical_engineering 0.70 0.78 0.74 10000
112
+ painting 0.91 0.96 0.94 10000
113
+ pet 0.91 0.95 0.93 10000
114
+ petroleum_and_natural_gas_engineering 0.92 0.96 0.94 10000
115
+ philosophy 0.63 0.66 0.64 10000
116
+ photo 0.80 0.85 0.82 10000
117
+ physics 0.40 0.35 0.37 10000
118
+ politics 0.38 0.41 0.39 10000
119
+ psychology 0.62 0.66 0.64 10000
120
+ public_administration 0.35 0.33 0.34 10000
121
+ relationship 0.84 0.88 0.86 10000
122
+ sociology 0.46 0.50 0.48 10000
123
+ sports 0.66 0.82 0.73 10000
124
+ statistics 0.60 0.70 0.65 10000
125
+ systems_science 0.53 0.53 0.53 10000
126
+ textile_science 0.81 0.86 0.83 10000
127
+ topicality 0.97 0.99 0.98 10000
128
+ transportation_engineering 0.51 0.52 0.51 10000
129
+ travel 0.68 0.72 0.70 10000
130
+ urban_planning 0.56 0.62 0.59 10000
131
+ weapons_science 0.97 0.99 0.98 10000
132
+
133
+ accuracy 0.64 670000
134
+ macro avg 0.62 0.64 0.63 670000
135
+ weighted avg 0.62 0.64 0.63 670000
136
+
137
+ ```
138
+
139
+
140
+ ## ⚠️Known Limitation
141
+ The classifier does not handle short text well, which might not be surprising.