Joblib
Safetensors
English
deberta-v2

Model Card for FinerWeb Line Quality Classifier

This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models.

Model Details

Model Description

  • Developed by: University of Turku (Erik Henriksson*, Otto Tarkka*, Filip Ginter) (*Equal contribution.)
  • Model type: Line-level text quality classifier
  • Language(s) (NLP): English
  • License: apache-2.0
  • Finetuned from model: microsoft/deberta-v3-base

Model Sources

Uses

Direct Use

The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content.

Out-of-Scope Use

The model is specifically trained on English web text and may not perform well on other languages or specialized domains.

Training Details

Training Data

The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved:

  1. Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels
  2. Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model
  3. Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels

The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories.

Training Procedure

Training Hyperparameters

  • Training regime: bfloat16 precision
  • Learning rate: 1e-5
  • Batch size: 16
  • Early stopping: Applied with patience of 5 based on evaluation loss
  • Maximum epochs: 5
  • Label smoothing: 0.1 applied to cross-entropy loss

Evaluation

Testing Data, Factors & Metrics

Metrics

The model was evaluated using:

  • Micro F1 score: 0.81
  • Macro F1 score: 0.66
  • Clean class metrics:
    • Precision: 0.88
    • Recall: 0.91
    • F1: 0.90

Technical Specifications

Compute Infrastructure

Hardware

Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU.

Downloads last month
1
Safetensors
Model size
184M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for TurkuNLP/finerweb-quality-classifier

Finetuned
(290)
this model

Dataset used to train TurkuNLP/finerweb-quality-classifier