Common Pile

Team

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

soldni authored a paper 22 days ago

2 OLMo 2 Furious

soldni authored a paper 22 days ago

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

soldni authored a paper 22 days ago

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

View all activity

Articles

Announcing the Common Pile and Comma v0.1

Jun 6, 2025

•

soldni

authored 11 papers 22 days ago

authored a paper 2 months ago

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Paper • 2512.20757 • Published Dec 23, 2025 • 18

stellaathena

in common-pile/arxiv_abstracts_filtered 3 months ago

Add link to Common Pile paper and note the dataset is a component of the Common Pile v0.1.

#2 opened 9 months ago by

nielsr

stellaathena

in common-pile/library_of_congress 3 months ago

Improve dataset card: Clarify name, add license and link to Github repo

#2 opened 9 months ago by

nielsr

stellaathena

in common-pile/comma-v0.1-2t 4 months ago

add your new released models+datasets to your eleuther releases on the website <3

#2 opened 5 months ago by

busssard

conceptofmind

in common-pile/caselaw_access_project 7 months ago

Set this up with ApertureDB Croissant ingestion and build RAG

#2 opened 7 months ago by

vishakha041

baber

updated a dataset 7 months ago

common-pile/raw_v0.1_parquet

Viewer • Updated Jul 16, 2025 • 2.58B • 3.6k • 1

baber

published a dataset 8 months ago

common-pile/raw_v0.1_parquet

Viewer • Updated Jul 16, 2025 • 2.58B • 3.6k • 1

craffel

authored a paper 8 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26, 2025 • 77

conceptofmind

authored a paper 8 months ago

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 10

AI & ML interests

Recent Activity

Articles

Announcing the Common Pile and Comma v0.1

Team members 13

common-pile's activity

Add link to Common Pile paper and note the dataset is a component of the Common Pile v0.1.

Improve dataset card: Clarify name, add license and link to Github repo

add your new released models+datasets to your eleuther releases on the website <3

Set this up with ApertureDB Croissant ingestion and build RAG