The Pile Companion
Viewer • Updated • 740M • 6.76kNote Includes both the raw (`/data`) and tokenised (`/tokenized`) the Pile deduplicated data. See README.md for more details.
pietrolesci/pile-deduped-pythia-preshuffled
Viewer • Updated • 244M • 3.68kNote Includes the tokenised and packed data in the exact order they were seen by the Pythia "dedup" models. Note that some documents are repeated (though in a different order) because the "dedup" models are trained for ~1.5 epochs in order to keep an equal token count with respect to the "non-dedup" version.
co-evolve/pile-deduped-pythia-tokfreq
Viewer • Updated • 50.1k • 4Note Includes a simple token count. Note that, since we now have the preshuffled data in a parquet format (`pietrolesci/pile-deduped-pythia-preshuffled`), recreating this dataset only requires a simply duckdb SQL query, which runs in a few minutes locally. See more in the README.md.
pietrolesci/pile-validation
Viewer • Updated • 429k • 128Note Includes the validation set for the Pile. This split has not been seen by Pythia models, thus it can be used for evaluation.