FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

joelniklaus updated a collection 10 days ago

joelniklaus updated a collection 10 days ago

joelniklaus updated a collection 10 days ago

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

Organization Card

Community About org cards

🍷 FineData

This is the home of the 🍷 FineData team, a branch of the 🤗 Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.

🍷 FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
📚 FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
🥂 FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
📄 FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web. See the blogpost
🌐 FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
📄 FinePDFs-Edu: 350B+ highly educational tokens filtered from 📄 FinePDFs
💬 FineTranslations: 1+1T tokens of parallel text translated from 500+ 🥂 FineWeb2 languages

Collections 8

View 8 collections

spaces 7

FinePDFs: Liberating 3T of the finest tokens from PDFs

FineWiki Viewer

Viewer to explore the finewiki dataset

FineWeb: decanting the web for the finest text data at scale

Generate a curated web‑text dataset for LLM training

Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks

Evaluate multilingual models using FineTasks

Tasks Explorer

Explore and analyze experiment results

models 105

HuggingFaceFW/finepdfs_edu_classifier_eng_Latn

0.4B • Updated Nov 11, 2025 • 4.84k • 2

HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn

0.4B • Updated Oct 6, 2025 • 21

HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn

0.4B • Updated Oct 6, 2025 • 13

HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn

0.4B • Updated Oct 6, 2025 • 7

HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr

0.3B • Updated Oct 6, 2025

HuggingFaceFW/finepdfs_edu_classifier_nno_Latn

0.3B • Updated Oct 6, 2025 • 7

HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl

0.3B • Updated Oct 6, 2025 • 3

HuggingFaceFW/finepdfs_edu_classifier_tam_Taml

0.3B • Updated Oct 6, 2025 • 1

HuggingFaceFW/finepdfs_edu_classifier_azj_Latn

0.3B • Updated Oct 6, 2025 • 2

HuggingFaceFW/finepdfs_edu_classifier_afr_Latn

0.3B • Updated Oct 6, 2025 • 3

View 105 models

datasets 34

HuggingFaceFW/fineweb_100BT-shuffled

Viewer • Updated 10 days ago • 161M • 18

HuggingFaceFW/fineweb_edu_100BT-shuffled

Viewer • Updated 10 days ago • 102M • 195

HuggingFaceFW/finepdfs_100BT-shuffled

Viewer • Updated 11 days ago • 14.6M • 91

HuggingFaceFW/dclm_100BT-shuffled

Viewer • Updated 11 days ago • 89.3M • 68

HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT-shuffled

Viewer • Updated 11 days ago • 62.1M • 49 • 2

HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT-shuffled

Viewer • Updated 11 days ago • 56.1M • 124

HuggingFaceFW/finepdfs_edu_100BT-shuffled

Viewer • Updated 11 days ago • 17.8M • 247

HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT

Viewer • Updated 12 days ago • 62.1M • 6.09k

HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT

Viewer • Updated 12 days ago • 56.1M • 22.9k

HuggingFaceFW/finepdfs_100BT

Viewer • Updated 13 days ago • 29.9M • 4.67k

View 34 datasets