pietrolesci/pile-deduped
Viewer
•
Updated
•
748M
•
12.3k
Note Includes both the raw (`/data`) and tokenised (`/tokenized`) the Pile deduplicated data. See README.md for more details.
Note Includes the tokenised and packed data in the exact order they were seen by the Pythia "dedup" models. Note that some documents are repeated (though in a different order) because the "dedup" models are trained for ~1.5 epochs in order to keep an equal token count with respect to the "non-dedup" version.