AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
π· FineData
This is the home of the π· FineData team, a branch of the π€ Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- π· FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- π FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- π₯ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- π FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- π FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
spaces
6
Running
3
FineWiki Viewer
π
Viewer to explore the finewiki dataset
Running
1.11k
FineWeb: decanting the web for the finest text data at scale
π·
Generate high-quality text data for LLMs using FineWeb
Running
69
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
π
Evaluate multilingual models using FineTasks
Sleeping
Tasks Explorer
π’
Explore and analyze experiment results
Running
4
Datasets Metrics Explorer
π
Launch an interactive demo interface
models
30
HuggingFaceFW/fineweb-edu-classifier
Text Classification
β’
0.1B
β’
Updated
β’
1.46k
β’
β’
197
HuggingFaceFW/Datasets-Metrics-Viewer-Data
Updated
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation
β’
2B
β’
Updated
β’
414
β’
16
HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT
Text Generation
β’
2B
β’
Updated
β’
1
β’
1
HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT
Text Generation
β’
2B
β’
Updated
β’
2
HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT
Text Generation
β’
2B
β’
Updated
β’
1
HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT
Text Generation
β’
2B
β’
Updated
β’
12
β’
3
HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT
Text Generation
β’
2B
β’
Updated
β’
5
β’
2
HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
Text Generation
β’
2B
β’
Updated
β’
3
β’
4
HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT
Text Generation
β’
2B
β’
Updated
β’
2
β’
2
datasets
12
HuggingFaceFW/finewiki
Viewer
β’
Updated
β’
61.6M
β’
2.66k
β’
107
HuggingFaceFW/clean-wikipedia
Viewer
β’
Updated
β’
61.2M
β’
1.91k
β’
22
HuggingFaceFW/finepdfs_lang_classification_tmp
Updated
β’
4
HuggingFaceFW/ocr-annotations
Viewer
β’
Updated
β’
1.62k
β’
106
β’
11
HuggingFaceFW/finepdfs_lang_classification
Viewer
β’
Updated
β’
3.08M
β’
762
β’
3
HuggingFaceFW/finepdfs
Viewer
β’
Updated
β’
475M
β’
47.6k
β’
625
HuggingFaceFW/fineweb
Viewer
β’
Updated
β’
52.5B
β’
276k
β’
2.4k
HuggingFaceFW/fineweb-edu
Viewer
β’
Updated
β’
3.5B
β’
387k
β’
783
HuggingFaceFW/fineweb-edu-score-2
Viewer
β’
Updated
β’
13.9B
β’
20k
β’
80
HuggingFaceFW/fineweb-2
Viewer
β’
Updated
β’
5.02B
β’
117k
β’
675