kenhktsui
/

finefineweb-domain-fasttext-classifier

Text Classification

fastText

English

Model card Files Files and versions Community

kenhktsui commited on 15 days ago

Commit

c4a01ff

verified ·

1 Parent(s): 48fb42d

Create README.md

Browse files

Files changed (1) hide show

README.md +141 -0

README.md ADDED Viewed

	@@ -0,0 +1,141 @@

+---
+datasets:
+- kenhktsui/FineFineWeb-First100K
+tags:
+- fasttext
+language:
+- en
+metrics:
+- f1
+pipeline_tag: text-classification
+---
+# finefineweb-domain-fasttext-classifier
+This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
+This classifier classifies a text into domains specified in [m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb).
+The classifier can be used for LLM pretraining data curation, to enhance capability in many domains.
+It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
+Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice.
+For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced.
+## 🛠️Usage
+```python
+from typing import List
+import re
+from huggingface_hub import hf_hub_download
+import fasttext
+model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin"))
+def replace_newlines(text: str) -> str:
+  return re.sub("\n+", " ", text)
+def predict(text_list):
+  text_list = [replace_newlines(text) for text in text_list]
+  pred = model.predict(text_list)
+  return [{"label": l[0][9:], "score": s[0]}
+           for l, s in zip(*pred)]
+predict(
+  [
+      "Arsenal is the best team in the world",
+      "Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.",
+      "Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.",
+      "Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs."
+  ]
+)
+# [{'label': 'sports', 'score': 0.5640762},
+# {'label': 'economics', 'score': 0.53133816},
+# {'label': 'physics', 'score': 0.9524484},
+# {'label': 'computer_science_and_technology', 'score': 0.41515663}]
+```
+## 📊Evaluation
+full version
+```
+                                       precision    recall  f1-score   support
+                            aerospace       0.69      0.72      0.71     10000
+                             agronomy       0.68      0.74      0.71     10000
+                             artistic       0.37      0.24      0.29     10000
+                            astronomy       0.67      0.76      0.71     10000
+                  atmospheric_science       0.82      0.92      0.87     10000
+                           automotive       0.66      0.74      0.70     10000
+                               beauty       0.82      0.86      0.84     10000
+                              biology       0.44      0.45      0.45     10000
+                            celebrity       0.69      0.81      0.75     10000
+                            chemistry       0.51      0.49      0.50     10000
+                         christianity       0.80      0.84      0.82     10000
+                    civil_engineering       0.58      0.58      0.58     10000
+            communication_engineering       0.63      0.67      0.65     10000
+      computer_science_and_technology       0.63      0.59      0.61     10000
+                               design       0.51      0.42      0.46     10000
+                       drama_and_film       0.53      0.53      0.53     10000
+                            economics       0.34      0.26      0.29     10000
+                   electronic_science       0.42      0.35      0.38     10000
+                        entertainment       0.43      0.29      0.34     10000
+                environmental_science       0.42      0.35      0.38     10000
+                              fashion       0.72      0.77      0.74     10000
+                              finance       0.49      0.52      0.50     10000
+                                 food       0.81      0.86      0.83     10000
+                               gamble       0.78      0.93      0.85     10000
+                                 game       0.67      0.67      0.67     10000
+                            geography       0.42      0.33      0.37     10000
+                               health       0.43      0.29      0.34     10000
+                              history       0.64      0.71      0.67     10000
+                                hobby       0.45      0.37      0.41     10000
+                hydraulic_engineering       0.95      0.98      0.96     10000
+                   instrument_science       0.48      0.50      0.49     10000
+   journalism_and_media_communication       0.26      0.11      0.16     10000
+               landscape_architecture       0.78      0.83      0.80     10000
+                                  law       0.50      0.55      0.53     10000
+                              library       0.53      0.51      0.52     10000
+                           literature       0.52      0.53      0.52     10000
+                    materials_science       0.49      0.50      0.50     10000
+                          mathematics       0.87      0.90      0.88     10000
+               mechanical_engineering       0.48      0.37      0.42     10000
+                              medical       0.41      0.42      0.41     10000
+                   mining_engineering       0.84      0.93      0.89     10000
+                                movie       0.59      0.71      0.64     10000
+                      music_and_dance       0.75      0.86      0.80     10000
+                                 news       0.23      0.13      0.16     10000
+                      nuclear_science       0.92      0.96      0.94     10000
+                        ocean_science       0.83      0.92      0.88     10000
+                  optical_engineering       0.70      0.78      0.74     10000
+                             painting       0.91      0.96      0.94     10000
+                                  pet       0.91      0.95      0.93     10000
+petroleum_and_natural_gas_engineering       0.92      0.96      0.94     10000
+                           philosophy       0.63      0.66      0.64     10000
+                                photo       0.80      0.85      0.82     10000
+                              physics       0.40      0.35      0.37     10000
+                             politics       0.38      0.41      0.39     10000
+                           psychology       0.62      0.66      0.64     10000
+                public_administration       0.35      0.33      0.34     10000
+                         relationship       0.84      0.88      0.86     10000
+                            sociology       0.46      0.50      0.48     10000
+                               sports       0.66      0.82      0.73     10000
+                           statistics       0.60      0.70      0.65     10000
+                      systems_science       0.53      0.53      0.53     10000
+                      textile_science       0.81      0.86      0.83     10000
+                           topicality       0.97      0.99      0.98     10000
+           transportation_engineering       0.51      0.52      0.51     10000
+                               travel       0.68      0.72      0.70     10000
+                       urban_planning       0.56      0.62      0.59     10000
+                      weapons_science       0.97      0.99      0.98     10000
+                             accuracy                           0.64    670000
+                            macro avg       0.62      0.64      0.63    670000
+                         weighted avg       0.62      0.64      0.63    670000
+```
+## ⚠️Known Limitation
+The classifier does not handle short text well, which might not be surprising.