FinePDFs-DCLM classifier (English)

Model summary

This is a classifier for judging the instructional/q&a value of web pages. It was developed to filter and curate instructional/q&a content from web datasets and was trained on 1304547 annotations generated by Qwen3-235B-A22B-Instruct-2507 for web samples from FinePDFs dataset.

How to use in transformers

To load the FinePDFs-DCLM classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import re
CHUNK_SIZE = 2048 - 2
MAX_CHARS = 10_000

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_dclm_classifier_English")
model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_dclm_classifier_English")
regex_whitespace = re.compile(r'\s')

def create_text_chunks(text: str, tokenizer):
    def trim_to_whitespace(text: str, trim_start: bool = True, trim_end: bool = True):
        if trim_start:
            match = regex_whitespace.search(text)
            if match:
                text = text[match.start()+1:]
            else:
                text = text[10:]
        if trim_end:
            match = regex_whitespace.search(text[::-1])
            if match:
                text = text[:len(text) - match.start() - 1]
            else:
                text = text[:-10]
        return text

    # First tokenize the text
    # Speed hack, we take at most
    if len(text) <= 2*MAX_CHARS:
        tokens = tokenizer.encode(text[:MAX_CHARS], return_tensors="np", add_special_tokens=False)[0]
        # Process the top chunks
        chunks_from_top_sampled = [tokens[:CHUNK_SIZE]]

        chunks_top_text = tokenizer.batch_decode(chunks_from_top_sampled, skip_special_tokens=True)

        chunks_top_text = [trim_to_whitespace(chunks_top_text[0], trim_start=False, trim_end=True)]
        return [chunks_top_text]

    else:
        # We tokenize the top and bottom of text
        text_top = text[:MAX_CHARS]
        text_bottom = text[-MAX_CHARS:]

        tokens = tokenizer.batch_encode_plus([text_top, text_bottom], return_tensors="np", add_special_tokens=False)["input_ids"]

        # This ensures that the second chunks is always maxed out
        chunks = [tokens[0][:CHUNK_SIZE], tokens[1][-CHUNK_SIZE:]]

        chunks_text = tokenizer.batch_decode(chunks, skip_special_tokens=True)
        chunks_top_text = [trim_to_whitespace(chunks_text[0], trim_start=False, trim_end=True)]
        chunks_bottom_text = [trim_to_whitespace(chunks_text[1], trim_start=True, trim_end=False)]
        return chunks_top_text + chunks_bottom_text

text = "This is a test sentence." * 2000
chunks = create_text_chunks(text, tokenizer)
scores = []
for chunk in chunks:
    inputs = tokenizer(chunk, return_tensors="pt", padding="longest", truncation=True)
    outputs = model(**inputs)
    logits = outputs.logits.squeeze(-1).float().detach().numpy()
    score = logits.item()
    scores.append(score)

print(max(scores))

Training

The classifier was trained on 7740960 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their instruction/q&a quality with 0 being not instructional and 5 being highly instructional.

Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:

Below is an extract from a PDF file. Evaluate whether the extract exhibits properties suitable for instruction-following or question-answering training data using the 6-point scoring system described below. Select the single score that best represents the extract's quality level:

**Score 0: Spam, Garbled, or Completely Unusable Content**
- Award 0 points for SEO spam content, promotional material with no educational value, completely garbled/corrupted text that is unreadable, random character sequences, or severely corrupted formatting that makes the content incomprehensible.

**Score 1: Simple Lists, Forms, or Minimal-Value Content**
- Award 1 point for content that has basic readable formatting but consists primarily of simple lists without context, forms, contact information, schedules, basic data tables without explanation, or other minimal-value structured content that lacks meaningful narrative or educational substance.

**Score 2: Cohesive Text Without Educational Value**
- Award 2 points if the extract contains cohesive, well-structured text that flows logically but lacks educational or instructional value. This includes meeting reports, business correspondence, letters, basic manual descriptions, administrative documents, or narrative content that doesn't teach or explain concepts.

**Score 3: Educational Content Without Q&A Structure**
- Award 3 points if the extract contains educational or informational content that could be useful for learning but doesn't follow a clear instructional format. This includes Wikipedia-style articles, research papers, academic content, encyclopedic entries, or explanatory text that presents information without explicit teaching structure.

**Score 4: Instructional Manuals and Structured Q&A**
- Award 4 points if the extract demonstrates clear instructional format with identifiable structure such as how-to guides, instruction manuals, structured question-answer pairs, problem-solution formats, or other organized pedagogical patterns. The content should be well-organized and follow recognizable educational conventions.

**Score 5: High-Quality Instructional Content with Explanations**
- Award 5 points if the extract exhibits exemplary instruction-response or question-answer properties with clear reasoning and detailed explanations. It should demonstrate thoughtful, step-by-step reasoning found in high-quality educational content like comprehensive tutorials, detailed explanations with context and reasoning, or expert-level instructional material that provides not just answers but explanatory reasoning and educational depth.

## Evaluation Process

The extract: {example}

After examining the extract:
- Briefly justify your total score, focusing on the content type and instructional/explanatory qualities, up to 100 words.
- Conclude with the score using the format: "Instruction/Q&A score: <total points>\

We added a classification head with a single regression output to answerdotai/ModernBERT-large, unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.

Training Details:

  • Model: answerdotai/ModernBERT-large with a classification head
  • Dataset: 7740960 samples from Qwen3-235B-A22B-Instruct-2507 annotations
  • Steps: 5000
  • Learning Rate: 3e-4
  • class distribution: {0: 1290160, 1: 1290160, 2: 1290160, 3: 1290160, 4: 1290160, 5: 1290160}
  • Evaluation Metric: F1 score

Classification report

We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 0 Qwen3-235B-A22B-Instruct-2507-annotated samples.

Validation Report:
|   class |   precision |   recall |   f1-score |   support |
|--------:|------------:|---------:|-----------:|----------:|
|       0 |        0.81 |     0.91 |       0.86 |      2036 |
|       1 |        0.79 |     0.75 |       0.77 |      4526 |
|       2 |        0.81 |     0.77 |       0.79 |      7277 |
|       3 |        0.75 |     0.67 |       0.71 |      4720 |
|       4 |        0.37 |     0.72 |       0.49 |       943 |
|       5 |        0.47 |     0.47 |       0.47 |       498 |

Confusion matrix

We verify that the predicted dclm scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.

Confusion Matrix:
|   class  |    0 |    1 |    2 |    3 |   4 |   5 |
|---------:|-----:|-----:|-----:|-----:|----:|----:|
|        0 | 1856 |  148 |   26 |    4 |   1 |   1 |
|        1 |  408 | 3406 |  631 |   67 |  13 |   1 |
|        2 |   29 |  689 | 5589 |  820 | 133 |  17 |
|        3 |    2 |   57 |  597 | 3154 | 787 | 123 |
|        4 |    0 |    8 |   24 |  104 | 682 | 125 |
|        5 |    0 |    1 |    9 |   45 | 207 | 236 |

Limitations

While the FinePDFs-DCLM classifier performs well in distinguishing high-quality instructional content for FinePDFs dataset, there are some limitations:

  • Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on instruction/q&a content may not perform as well on specialized domains.
  • Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to instructional/q&a looking content for the higher scores and we recommend using int_score >= 3.5 as a threshold for data curation.
  • Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.

The training and inference code is available on GitHub https://github.com/huggingface/finepdfs/tree/main/classification

Downloads last month
38
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn

Space using HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn 1

Collection including HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn