See `unidisc/datasets/preprocessing` for instructions on how to preprocess datasets. | |
We support the following datasets: | |
- Cambrian | |
- CapsFusion | |
- CC12M | |
- DataComp1B | |
- JourneyDB | |
- LAION400M | |
- MMC4 | |
- PixelProse | |
Additionally, we generated our own synthetic dataset available [here](https://huggingface.co/datasets/aswerdlow/unidisc_hq) and provide the [generation scripts](../unidisc/datasets/preprocessing/unidisc_dataset/README.md) as well as the raw data. |