Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts Viewer β’ Updated May 31, 2024 β’ 4.16M β’ 279
Tristan/RedPajama-Data-V2-sample-100B-filtered-for-regression-domains-with-domains Viewer β’ Updated May 24, 2024 β’ 4.16M β’ 159
Tristan/olm-wikipedia-20221220-1-percent-tokenized-766 Viewer β’ Updated Jan 18, 2023 β’ 65.1k β’ 6
Tristan/olm-wikipedia-20221220-1-percent-tokenized-568 Viewer β’ Updated Jan 17, 2023 β’ 87.8k β’ 7
Tristan/t5-small-october-wikipedia-2022-tokenized-512 Viewer β’ Updated Jan 2, 2023 β’ 9.74M β’ 104
Tristan/wikipedia-august-october-line-diff-1000-char-threshold-1000-sample Viewer β’ Updated Dec 13, 2022 β’ 1k β’ 13
Tristan/wikipedia-august-october-line-diff-1000-char-threshold Viewer β’ Updated Dec 13, 2022 β’ 286k β’ 10
Tristan/olm-october-2022-tokenized-1024-suffix-array-dedup Viewer β’ Updated Dec 11, 2022 β’ 13.2M β’ 7
Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-suffix-array-dedup Viewer β’ Updated Dec 10, 2022 β’ 7.52M β’ 95
Tristan/olm-october-2022-tokenized-1024-perplexity-filters Viewer β’ Updated Dec 9, 2022 β’ 12.8M β’ 8
Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-perplexity-filters Viewer β’ Updated Dec 8, 2022 β’ 14.6M β’ 12
Tristan/olm-october-2022-with-bookcorpus-tokenized-1024 Viewer β’ Updated Dec 7, 2022 β’ 14.3M β’ 13
Tristan/olm-october-2022-tokenized-1024-no-bigscience-filters Viewer β’ Updated Dec 7, 2022 β’ 12.9M β’ 11