Model Card for FinerWeb Line Quality Classifier

This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models.

Model Details

Model Description

Developed by: University of Turku (Erik Henriksson*, Otto Tarkka*, Filip Ginter) (*Equal contribution.)
Model type: Line-level text quality classifier
Language(s) (NLP): English
License: apache-2.0
Finetuned from model: microsoft/deberta-v3-base

Model Sources

Paper: https://arxiv.org/abs/2501.07314
Repository: https://github.com/TurkuNLP/finerweb-10bt

Uses

Direct Use

The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content.

Out-of-Scope Use

The model is specifically trained on English web text and may not perform well on other languages or specialized domains.

Training Details

Training Data

The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved:

Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels
Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model
Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels

The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories.

Training Procedure

Training Hyperparameters

Training regime: bfloat16 precision
Learning rate: 1e-5
Batch size: 16
Early stopping: Applied with patience of 5 based on evaluation loss
Maximum epochs: 5
Label smoothing: 0.1 applied to cross-entropy loss

Evaluation

Testing Data, Factors & Metrics

Metrics

The model was evaluated using:

Micro F1 score: 0.81
Macro F1 score: 0.66
Clean class metrics:
- Precision: 0.88
- Recall: 0.91
- F1: 0.90

Technical Specifications

Compute Infrastructure

Hardware

Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU.

TurkuNLP
/

finerweb-quality-classifier