See unidisc/datasets/preprocessing
for instructions on how to preprocess datasets.
We support the following datasets:
- Cambrian
- CapsFusion
- CC12M
- DataComp1B
- JourneyDB
- LAION400M
- MMC4
- PixelProse
Additionally, we generated our own synthetic dataset available here and provide the generation scripts as well as the raw data.