-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper β’ 2506.20920 β’ Published β’ 57 -
HuggingFaceFW/fineweb-2
Viewer β’ Updated β’ 5.02B β’ 38.3k β’ 567 -
67
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
πEvaluate multilingual models using FineTasks
FineData
Enterprise
community
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
FineWeb-Edu datasets, classifier and ablation model
-
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.3B β’ 119k β’ 705 -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.1B β’ 7.54k β’ 77 -
HuggingFaceFW/fineweb-edu-classifier
Text Classification β’ 0.1B β’ Updated β’ 5.93k β’ β’ 186 -
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 1.97k β’ 15
Ablation models trained for our data experiments.
-
HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation β’ 2B β’ Updated β’ 13 -
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation β’ 2B β’ Updated β’ 8 -
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation β’ 2B β’ Updated β’ 36 -
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation β’ 2B β’ Updated β’ 12
-
988
FineWeb: decanting the web for the finest text data at scale
π·Generate high-quality web text data for LLM training
-
HuggingFaceFW/fineweb
Viewer β’ Updated β’ 25B β’ 210k β’ 2.23k -
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.3B β’ 119k β’ 705 -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.1B β’ 7.54k β’ 77
1.8B models trained on 350BT to compare different pretraining datasets
-
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 1.97k β’ 15 -
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation β’ 2B β’ Updated β’ 1.77k β’ 14 -
HuggingFaceFW/ablation-model-refinedweb
Text Generation β’ 2B β’ Updated β’ 32 β’ 3 -
HuggingFaceFW/ablation-model-c4
Text Generation β’ 2B β’ Updated β’ 31 β’ 4
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper β’ 2506.20920 β’ Published β’ 57 -
HuggingFaceFW/fineweb-2
Viewer β’ Updated β’ 5.02B β’ 38.3k β’ 567 -
67
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
πEvaluate multilingual models using FineTasks
-
988
FineWeb: decanting the web for the finest text data at scale
π·Generate high-quality web text data for LLM training
-
HuggingFaceFW/fineweb
Viewer β’ Updated β’ 25B β’ 210k β’ 2.23k -
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.3B β’ 119k β’ 705 -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.1B β’ 7.54k β’ 77
FineWeb-Edu datasets, classifier and ablation model
-
HuggingFaceFW/fineweb-edu
Viewer β’ Updated β’ 3.3B β’ 119k β’ 705 -
HuggingFaceFW/fineweb-edu-score-2
Viewer β’ Updated β’ 13.1B β’ 7.54k β’ 77 -
HuggingFaceFW/fineweb-edu-classifier
Text Classification β’ 0.1B β’ Updated β’ 5.93k β’ β’ 186 -
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 1.97k β’ 15
1.8B models trained on 350BT to compare different pretraining datasets
-
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation β’ 2B β’ Updated β’ 1.97k β’ 15 -
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation β’ 2B β’ Updated β’ 1.77k β’ 14 -
HuggingFaceFW/ablation-model-refinedweb
Text Generation β’ 2B β’ Updated β’ 32 β’ 3 -
HuggingFaceFW/ablation-model-c4
Text Generation β’ 2B β’ Updated β’ 31 β’ 4
Ablation models trained for our data experiments.
-
HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation β’ 2B β’ Updated β’ 13 -
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation β’ 2B β’ Updated β’ 8 -
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation β’ 2B β’ Updated β’ 36 -
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation β’ 2B β’ Updated β’ 12