--- datasets: - kenhktsui/FineFineWeb-First100K tags: - fasttext language: - en metrics: - f1 pipeline_tag: text-classification --- # finefineweb-domain-fasttext-classifier This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset. This classifier classifies a text into domains specified in [m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb). The classifier can be used for LLM pretraining data curation, to enhance capability in many domains. It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU. Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice. For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced. ## 🛠️Usage ```python from typing import List import re from huggingface_hub import hf_hub_download import fasttext model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin")) def replace_newlines(text: str) -> str: return re.sub("\n+", " ", text) def predict(text_list): text_list = [replace_newlines(text) for text in text_list] pred = model.predict(text_list) return [{"label": l[0][9:], "score": s[0]} for l, s in zip(*pred)] predict( [ "Arsenal is the best team in the world", "Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.", "Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.", "Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs." ] ) # [{'label': 'sports', 'score': 0.5640762}, # {'label': 'economics', 'score': 0.53133816}, # {'label': 'physics', 'score': 0.9524484}, # {'label': 'computer_science_and_technology', 'score': 0.41515663}] ``` ## 📊Evaluation full version ``` precision recall f1-score support aerospace 0.69 0.72 0.71 10000 agronomy 0.68 0.74 0.71 10000 artistic 0.37 0.24 0.29 10000 astronomy 0.67 0.76 0.71 10000 atmospheric_science 0.82 0.92 0.87 10000 automotive 0.66 0.74 0.70 10000 beauty 0.82 0.86 0.84 10000 biology 0.44 0.45 0.45 10000 celebrity 0.69 0.81 0.75 10000 chemistry 0.51 0.49 0.50 10000 christianity 0.80 0.84 0.82 10000 civil_engineering 0.58 0.58 0.58 10000 communication_engineering 0.63 0.67 0.65 10000 computer_science_and_technology 0.63 0.59 0.61 10000 design 0.51 0.42 0.46 10000 drama_and_film 0.53 0.53 0.53 10000 economics 0.34 0.26 0.29 10000 electronic_science 0.42 0.35 0.38 10000 entertainment 0.43 0.29 0.34 10000 environmental_science 0.42 0.35 0.38 10000 fashion 0.72 0.77 0.74 10000 finance 0.49 0.52 0.50 10000 food 0.81 0.86 0.83 10000 gamble 0.78 0.93 0.85 10000 game 0.67 0.67 0.67 10000 geography 0.42 0.33 0.37 10000 health 0.43 0.29 0.34 10000 history 0.64 0.71 0.67 10000 hobby 0.45 0.37 0.41 10000 hydraulic_engineering 0.95 0.98 0.96 10000 instrument_science 0.48 0.50 0.49 10000 journalism_and_media_communication 0.26 0.11 0.16 10000 landscape_architecture 0.78 0.83 0.80 10000 law 0.50 0.55 0.53 10000 library 0.53 0.51 0.52 10000 literature 0.52 0.53 0.52 10000 materials_science 0.49 0.50 0.50 10000 mathematics 0.87 0.90 0.88 10000 mechanical_engineering 0.48 0.37 0.42 10000 medical 0.41 0.42 0.41 10000 mining_engineering 0.84 0.93 0.89 10000 movie 0.59 0.71 0.64 10000 music_and_dance 0.75 0.86 0.80 10000 news 0.23 0.13 0.16 10000 nuclear_science 0.92 0.96 0.94 10000 ocean_science 0.83 0.92 0.88 10000 optical_engineering 0.70 0.78 0.74 10000 painting 0.91 0.96 0.94 10000 pet 0.91 0.95 0.93 10000 petroleum_and_natural_gas_engineering 0.92 0.96 0.94 10000 philosophy 0.63 0.66 0.64 10000 photo 0.80 0.85 0.82 10000 physics 0.40 0.35 0.37 10000 politics 0.38 0.41 0.39 10000 psychology 0.62 0.66 0.64 10000 public_administration 0.35 0.33 0.34 10000 relationship 0.84 0.88 0.86 10000 sociology 0.46 0.50 0.48 10000 sports 0.66 0.82 0.73 10000 statistics 0.60 0.70 0.65 10000 systems_science 0.53 0.53 0.53 10000 textile_science 0.81 0.86 0.83 10000 topicality 0.97 0.99 0.98 10000 transportation_engineering 0.51 0.52 0.51 10000 travel 0.68 0.72 0.70 10000 urban_planning 0.56 0.62 0.59 10000 weapons_science 0.97 0.99 0.98 10000 accuracy 0.64 670000 macro avg 0.62 0.64 0.63 670000 weighted avg 0.62 0.64 0.63 670000 ``` ## ⚠️Known Limitation The classifier does not handle short text well, which might not be surprising.