Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts Viewer • Updated May 31, 2024 • 4.16M • 279
Tristan/RedPajama-Data-V2-sample-100B-filtered-for-regression-domains-with-domains Viewer • Updated May 24, 2024 • 4.16M • 159
Tristan/wikipedia-august-october-line-diff-1000-char-threshold-1000-sample Viewer • Updated Dec 13, 2022 • 1k • 13
Tristan/wikipedia-august-october-line-diff-1000-char-threshold Viewer • Updated Dec 13, 2022 • 286k • 10
Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-suffix-array-dedup Viewer • Updated Dec 10, 2022 • 7.52M • 95
Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-perplexity-filters Viewer • Updated Dec 8, 2022 • 14.6M • 12
Tristan/olm-october-2022-tokenized-1024-no-bigscience-filters Viewer • Updated Dec 7, 2022 • 12.9M • 11