FineData

Team

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

guipenedo updated a dataset 7 days ago

HuggingFaceFW/fineweb-2

guipenedo new activity 7 days ago

HuggingFaceFW/fineweb-2:Synthetic Data Generator

guipenedo new activity 7 days ago

HuggingFaceFW/fineweb-2:Number of rows not available for all configs.

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

meg

posted an update 5 days ago

Post

3441

🤖 Did you know your voice might be cloned without your consent from just *one sentence* of audio?
That's not great. So with @frimelle , we brainstormed a new idea for developers who want to curb malicious use: ✨The Voice Consent Gate.✨
Details, code, here: https://huggingface.co/blog/voice-consent-gate

3 replies

·

guipenedo

updated a dataset 7 days ago

HuggingFaceFW/fineweb-2

Viewer • Updated 7 days ago • 4.48B • 96.5k • 678

guipenedo

in HuggingFaceFW/fineweb-2 7 days ago

Synthetic Data Generator

#5 opened 10 months ago by

Number of rows not available for all configs.

#7 opened 6 months ago by

guipenedo

in HuggingFaceFW/finewiki 10 days ago

Filtered Cebuano?

#3 opened 12 days ago by

hynky

in HuggingFaceFW/finepdfs 12 days ago

OCR or not classifier

#6 opened about 2 months ago by

A Few Questions About the Implementation Details of the finepdfs Project

#24 opened 19 days ago by

guipenedo

in HuggingFaceFW/finewiki 12 days ago

docs: fix typo

#2 opened 13 days ago by

guipenedo

updated a dataset 13 days ago

HuggingFaceFW/clean-wikipedia

Viewer • Updated 13 days ago • 61.2M • 1.63k • 23

guipenedo

updated a Space 13 days ago

README

guipenedo

published a dataset 13 days ago

HuggingFaceFW/finewiki

Viewer • Updated 12 days ago • 61.6M • 12.8k • 197

guipenedo

published a Space 13 days ago

FineWiki Viewer

Viewer to explore the finewiki dataset

hynky

in HuggingFaceFW/finepdfs_lang_classification 13 days ago

datatrove

#2 opened 13 days ago by

hynky

published a dataset 13 days ago

HuggingFaceFW/finepdfs_lang_classification_tmp

Updated 13 days ago • 11

hynky

in HuggingFaceFW/finepdfs 14 days ago

Deciding on extraction path

#10 opened about 2 months ago by

Were the original PDFs saved?

#2 opened about 2 months ago by

Docling output

#4 opened about 2 months ago by

Can additional corpuses further train this model?

#13 opened about 2 months ago by

hynky

updated a dataset 14 days ago

HuggingFaceFW/ocr-annotations

Viewer • Updated 14 days ago • 1.62k • 225 • 13

guipenedo

updated a collection 14 days ago

📄 FinePDFs

78 items • Updated 14 days ago • 12