Spaces:
Running
Index for olmo-mix-1124
Thank you very much for this helpful tool! We were wondering if you might consider adding index for olmo-mix-1124 (https://huggingface.co/datasets/allenai/olmo-mix-1124). We’re particularly interested in using Infini-gram to search this dataset. If adding support isn’t possible, we would also appreciate any alternative suggestions you might have.
Hi, apologies for the super belated reply! The dataset you asked for has been indexed, and you can download at s3://infini-gram/index/v4_olmoe-mix-0924-dclm_llama and s3://infini-gram/index/v4_olmoe-mix-0924-nodclm_llama and then serve locally with the Python package. olmoe-mix-0924
is identical to olmo-mix-1124
.
I also put up (in the web interface and API endpoint) a few indexes for OLMo 2 training data. These indexes are mostly olmo-mix-1124 but also includes the mid-training and post-training data of OLMo 2 models. Just mentioning in case these are helpful.