Model Card for FinerWeb Line Quality Classifier
This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models.
Model Details
Model Description
- Developed by: University of Turku (Erik Henriksson*, Otto Tarkka*, Filip Ginter) (*Equal contribution.)
- Model type: Line-level text quality classifier
- Language(s) (NLP): English
- License: apache-2.0
- Finetuned from model: microsoft/deberta-v3-base
Model Sources
- Paper: https://arxiv.org/abs/2501.07314
- Repository: https://github.com/TurkuNLP/finerweb-10bt
Uses
Direct Use
The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content.
Out-of-Scope Use
The model is specifically trained on English web text and may not perform well on other languages or specialized domains.
Training Details
Training Data
The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved:
- Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels
- Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model
- Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels
The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories.
Training Procedure
Training Hyperparameters
- Training regime: bfloat16 precision
- Learning rate: 1e-5
- Batch size: 16
- Early stopping: Applied with patience of 5 based on evaluation loss
- Maximum epochs: 5
- Label smoothing: 0.1 applied to cross-entropy loss
Evaluation
Testing Data, Factors & Metrics
Metrics
The model was evaluated using:
- Micro F1 score: 0.81
- Macro F1 score: 0.66
- Clean class metrics:
- Precision: 0.88
- Recall: 0.91
- F1: 0.90
Technical Specifications
Compute Infrastructure
Hardware
Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU.
- Downloads last month
- 1
Model tree for TurkuNLP/finerweb-quality-classifier
Base model
microsoft/deberta-v3-base