AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
π· FineData
This is the home of the π· FineData team, a branch of the π€ Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- π· FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- π FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- π₯ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- π FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- π FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
- π FinePDFs-Edu: 350B+ highly educational tokens filtered from π FinePDFs
-
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 60.7k β’ 674 -
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 10.8k β’ 42 -
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 225 β’ 15 -
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 23.3k β’ 4
-
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 60.7k β’ 674 -
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 10.8k β’ 42 -
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 225 β’ 15 -
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 23.3k β’ 4
spaces
6
Running
8
FineWiki Viewer
π
Viewer to explore the finewiki dataset
Running
Featured
1.19k
FineWeb: decanting the web for the finest text data at scale
π·
Generate high-quality text data for LLMs using FineWeb
Running
81
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
π
Evaluate multilingual models using FineTasks
Build error
Tasks Explorer
π’
Explore and analyze experiment results
Runtime error
4
Datasets Metrics Explorer
π
Launch an interactive demo interface
models
105
HuggingFaceFW/finepdfs_edu_classifier_eng_Latn
0.4B
β’
Updated
β’
32
β’
2
HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn
0.4B
β’
Updated
β’
32
HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn
0.4B
β’
Updated
β’
22
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn
0.4B
β’
Updated
β’
12
HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr
0.3B
β’
Updated
β’
14
HuggingFaceFW/finepdfs_edu_classifier_nno_Latn
0.3B
β’
Updated
β’
10
HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl
0.3B
β’
Updated
β’
10
HuggingFaceFW/finepdfs_edu_classifier_tam_Taml
0.3B
β’
Updated
β’
11
HuggingFaceFW/finepdfs_edu_classifier_azj_Latn
0.3B
β’
Updated
β’
7
HuggingFaceFW/finepdfs_edu_classifier_afr_Latn
0.3B
β’
Updated
β’
12
datasets
15
HuggingFaceFW/finepdfs
Viewer
β’
Updated
β’
476M
β’
60.7k
β’
674
HuggingFaceFW/finepdfs-edu
Viewer
β’
Updated
β’
49.5M
β’
10.8k
β’
42
HuggingFaceFW/fineweb-2
Viewer
β’
Updated
β’
4.48B
β’
86.5k
β’
691
HuggingFaceFW/finewiki
Viewer
β’
Updated
β’
61.6M
β’
27k
β’
260
HuggingFaceFW/clean-wikipedia
Viewer
β’
Updated
β’
61.2M
β’
1.29k
β’
23
HuggingFaceFW/finepdfs_lang_classification_tmp
Updated
β’
25
HuggingFaceFW/ocr-annotations
Viewer
β’
Updated
β’
1.62k
β’
225
β’
15
HuggingFaceFW/finepdfs_lang_classification
Viewer
β’
Updated
β’
3.08M
β’
23.3k
β’
4
HuggingFaceFW/finepdfs_eng_Latn_labeled
Viewer
β’
Updated
β’
1.3M
β’
213
β’
2
HuggingFaceFW/finepdfs_fw_edu_labeled
Viewer
β’
Updated
β’
18.8M
β’
255
β’
3