See `unidisc/datasets/preprocessing` for instructions on how to preprocess datasets. We support the following datasets: - Cambrian - CapsFusion - CC12M - DataComp1B - JourneyDB - LAION400M - MMC4 - PixelProse Additionally, we generated our own synthetic dataset available [here](https://huggingface.co/datasets/aswerdlow/unidisc_hq) and provide the [generation scripts](../unidisc/datasets/preprocessing/unidisc_dataset/README.md) as well as the raw data.