Tried and tested mixes for strong pretraining
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
π· FineData
This is the home of the π· FineData team, a branch of the π€ Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- π· FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- π FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- π₯ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- π FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web. See the blogpost
- π FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
- π FinePDFs-Edu: 350B+ highly educational tokens filtered from π FinePDFs
- π¬ FineTranslations: 1+1T tokens of parallel text translated from 500+ π₯ FineWeb2 languages
spaces 7
Running
Featured
66
FinePDFs: Liberating 3T of the finest tokens from PDFs
π
Running
11
FineWiki Viewer
π
Viewer to explore the finewiki dataset
Running
Featured
1.3k
FineWeb: decanting the web for the finest text data at scale
π·
Generate a curated webβtext dataset for LLM training
Running
88
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
π
Evaluate multilingual models using FineTasks
Build error
1
Tasks Explorer
π’
Explore and analyze experiment results
models 105
HuggingFaceFW/finepdfs_edu_classifier_eng_Latn
0.4B β’ Updated
β’ 4.84k β’ 2
HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn
0.4B β’ Updated
β’ 21
HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn
0.4B β’ Updated
β’ 13
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn
0.4B β’ Updated
β’ 7
HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr
0.3B β’ Updated
HuggingFaceFW/finepdfs_edu_classifier_nno_Latn
0.3B β’ Updated
β’ 7
HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl
0.3B β’ Updated
β’ 3
HuggingFaceFW/finepdfs_edu_classifier_tam_Taml
0.3B β’ Updated
β’ 1
HuggingFaceFW/finepdfs_edu_classifier_azj_Latn
0.3B β’ Updated
β’ 2
HuggingFaceFW/finepdfs_edu_classifier_afr_Latn
0.3B β’ Updated
β’ 3
datasets 34
HuggingFaceFW/fineweb_100BT-shuffled
Viewer
β’ Updated
β’ 161M β’ 18
HuggingFaceFW/fineweb_edu_100BT-shuffled
Viewer
β’ Updated
β’ 102M β’ 195
HuggingFaceFW/finepdfs_100BT-shuffled
Viewer
β’ Updated
β’ 14.6M β’ 91
HuggingFaceFW/dclm_100BT-shuffled
Viewer
β’ Updated
β’ 89.3M β’ 68
HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT-shuffled
Viewer
β’ Updated
β’ 62.1M β’ 49 β’ 2
HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT-shuffled
Viewer
β’ Updated
β’ 56.1M β’ 124
HuggingFaceFW/finepdfs_edu_100BT-shuffled
Viewer
β’ Updated
β’ 17.8M β’ 247
HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT
Viewer
β’ Updated
β’ 62.1M β’ 6.09k
HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT
Viewer
β’ Updated
β’ 56.1M β’ 22.9k
HuggingFaceFW/finepdfs_100BT
Viewer
β’ Updated
β’ 29.9M β’ 4.67k