File size: 3,682 Bytes
7507f04
 
 
6ec1278
7507f04
6ec1278
7507f04
6ec1278
 
 
7507f04
6ec1278
7507f04
 
6ec1278
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: cc-by-4.0
datasets:
  - kenhktsui/math-classifiers-data
language:
  - en
metrics:
  - accuracy
  - recall
  - precision
base_model:
  - facebook/fasttext-en-vectors
pipeline_tag: text-classification
library_name: fasttext
---
# Model Card for FastText Math vs. Non-Math Classifier

A FastText-based binary classifier trained to distinguish “math” text from “non-math” text in English webpages. It is fine-tuned on the `kenhktsui/math-classifiers-data` dataset using `facebook/fasttext-en-vectors` as the base word-embedding model.

---

## Model Details

### Overview

This model takes raw English text (for example, the plain-text extraction of an HTML page) and predicts whether the content is math-related (label `__label__math`) or not (label `__label__non-math`). It was developed by user **herooooooooo** and is released under the CC-BY-4.0 license.

- **Model type:** Supervised FastText classifier (binary classification)
- **Developed by:** herooooooooo
- **License:** CC-BY-4.0
- **Language:** English (en)
- **Base model:** `facebook/fasttext-en-vectors` (pretrained word vectors)
- **Fine-tuned on:** `kenhktsui/math-classifiers-data` (a public Hugging Face dataset of labeled math vs. non-math examples)

### Intended Use

- **Primary application:** Filtering or labeling large corpora of webpages or documents for math content (e.g., selecting only math-related pages from web crawls).
- **Foreseeable users:** Researchers preparing math-focused corpora, data engineers curating domain-specific text, or educators building math content pipelines.
- **Out-of-scope:**  
  - Not intended for general topic classification beyond “math vs. non-math.”  
  - Performance may degrade on extremely short texts (less than ~20 tokens) or on highly technical subdomains not well represented in the training set (e.g., very specialized LaTeX macros not covered by the dataset).
  - Should not be used for any safety- or compliance-critical pipeline without additional validation.

---

## Bias, Risks, and Limitations

- **Biases:**  
  - The model is trained on the `kenhktsui/math-classifiers-data` dataset, which predominately contains English posts from math forums and random English web text. It may underperform on non-North American or non-European English dialects (e.g., Indian English math blogs) if they were underrepresented.  
  - The classifier can mislabel “math-adjacent” text (e.g., computer science blogs discussing algorithms, physics pages dense with formulas) as “non-math” if the training set did not include similar examples.

- **Technical limitations:**  
  - Since FastText is a bag-of-words (BoW + n-gram) approach, it does not capture very long-range dependencies or advanced context. Very subtle math content (e.g., a single embedded formula in an otherwise non-math article) may be missed.  
  - Very short snippets (e.g., a single equation or a title) may be misclassified because there may not be enough context to distinguish “math” from “non-math.”

### Recommendations

- Before applying at scale, evaluate on a held-out set of your target webpages (especially if they come from a domain not represented in the original dataset).  
- If you encounter persistent misclassification on a new subdomain (e.g., a specialized math blog), collect additional labeled examples from that source and fine-tune or retrain a new FastText model.  
- Use appropriate preprocessing (HTML-to-text extraction, removal of boilerplate navigation) to feed only the main article content into the model for best results.

---

## How to Get Started with the Model

Install dependencies:

```bash
pip install fasttext tiktoken