Update README.md
Browse files
README.md
CHANGED
|
@@ -4,4 +4,135 @@ datasets:
|
|
| 4 |
- ILSVRC/imagenet-1k
|
| 5 |
pipeline_tag: unconditional-image-generation
|
| 6 |
library_name: fairseq
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- ILSVRC/imagenet-1k
|
| 5 |
pipeline_tag: unconditional-image-generation
|
| 6 |
library_name: fairseq
|
| 7 |
+
---
|
| 8 |
+
<h1 align="center"> Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
|
| 9 |
+
</h1>
|
| 10 |
+
|
| 11 |
+
<div align="center">
|
| 12 |
+
|
| 13 |
+
[](https://arxiv.org/abs/2410.12490)
|
| 14 |
+
[](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?tag_filter=485&p=stabilize-the-latent-space-for-image)
|
| 15 |
+
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
This is the official implementation of DiGIT [(Github)](https://github.com/DAMO-NLP-SG/DiGIT) accepted at NeurIPS 2024. The code will be available soon.
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
We present **DiGIT**, an auto-regressive generative model performing next-token prediction in an abstract latent space derived from self-supervised learning (SSL) models. By employing K-Means clustering on the hidden states of the DINOv2 model, we effectively create a novel discrete tokenizer. This method significantly boosts image generation performance on ImageNet dataset, achieving an FID score of 4.59 for class-unconditional tasks and 3.39 for class-conditional tasks. Additionally, the model enhances image understanding, attaining a linear-probe accuracy of 80.3.
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
## Experimental Results
|
| 29 |
+
|
| 30 |
+
### Linear-Probe Accuracy on ImageNet
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
| Methods | \# Tokens | Features | \# Params | Top-1 Acc. $\uparrow$ |
|
| 34 |
+
|-----------------------------------|-------------|----------|------------|-----------------------|
|
| 35 |
+
| iGPT-L | 32 $\times$ 32 | 1536 | 1362M | 60.3 |
|
| 36 |
+
| iGPT-XL | 64 $\times$ 64 | 3072 | 6801M | 68.7 |
|
| 37 |
+
| VIM+VQGAN | 32 $\times$ 32 | 1024 | 650M | 61.8 |
|
| 38 |
+
| VIM+dVAE | 32 $\times$ 32 | 1024 | 650M | 63.8 |
|
| 39 |
+
| VIM+ViT-VQGAN | 32 $\times$ 32 | 1024 | 650M | 65.1 |
|
| 40 |
+
| VIM+ViT-VQGAN | 32 $\times$ 32 | 2048 | 1697M | 73.2 |
|
| 41 |
+
| AIM | 16 $\times$ 16 | 1536 | 0.6B | 70.5 |
|
| 42 |
+
| **DiGIT (Ours)** | 16 $\times$ 16 | 1024 | 219M | 71.7 |
|
| 43 |
+
| **DiGIT (Ours)** | 16 $\times$ 16 | 1536 | 732M | **80.3** |
|
| 44 |
+
|
| 45 |
+
### Class-Unconditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)
|
| 46 |
+
|
| 47 |
+
| Type | Methods | \# Param | \# Epoch | FID $\downarrow$ | IS $\uparrow$ |
|
| 48 |
+
|-------|-------------------------------------|----------|----------|------------------|----------------|
|
| 49 |
+
| GAN | BigGAN | 70M | - | 38.6 | 24.70 |
|
| 50 |
+
| Diff. | LDM | 395M | - | 39.1 | 22.83 |
|
| 51 |
+
| Diff. | ADM | 554M | - | 26.2 | 39.70 |
|
| 52 |
+
| MIM | MAGE | 200M | 1600 | 11.1 | 81.17 |
|
| 53 |
+
| MIM | MAGE | 463M | 1600 | 9.10 | 105.1 |
|
| 54 |
+
| MIM | MaskGIT | 227M | 300 | 20.7 | 42.08 |
|
| 55 |
+
| MIM | **DiGIT (+MaskGIT)** | 219M | 200 | **9.04** | **75.04** |
|
| 56 |
+
| AR | VQGAN | 214M | 200 | 24.38 | 30.93 |
|
| 57 |
+
| AR | **DiGIT (+VQGAN)** | 219M | 400 | **9.13** | **73.85** |
|
| 58 |
+
| AR | **DiGIT (+VQGAN)** | 732M | 200 | **4.59** | **141.29** |
|
| 59 |
+
|
| 60 |
+
### Class-Conditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
| Type | Methods | \# Param | \# Epoch | FID $\downarrow$ | IS $\uparrow$ |
|
| 65 |
+
|-------|----------------------|----------|----------|------------------|----------------|
|
| 66 |
+
| GAN | BigGAN | 160M | - | 6.95 | 198.2 |
|
| 67 |
+
| Diff. | ADM | 554M | - | 10.94 | 101.0 |
|
| 68 |
+
| Diff. | LDM-4 | 400M | - | 10.56 | 103.5 |
|
| 69 |
+
| Diff. | DiT-XL/2 | 675M | - | 9.62 | 121.50 |
|
| 70 |
+
| Diff. | L-DiT-7B | 7B | - | 6.09 | 153.32 |
|
| 71 |
+
| MIM | CQR-Trans | 371M | 300 | 5.45 | 172.6 |
|
| 72 |
+
| MIM+AR | VAR | 310M | 200 | 4.64 | - |
|
| 73 |
+
| MIM+AR | VAR | 310M | 200 | 3.60* | 257.5* |
|
| 74 |
+
| MIM+AR | VAR | 600M | 250 | 2.95* | 306.1* |
|
| 75 |
+
| MIM | MAGVIT-v2 | 307M | 1080 | 3.65 | 200.5 |
|
| 76 |
+
| AR | VQVAE-2 | 13.5B | - | 31.11 | 45 |
|
| 77 |
+
| AR | RQ-Trans | 480M | - | 15.72 | 86.8 |
|
| 78 |
+
| AR | RQ-Trans | 3.8B | - | 7.55 | 134.0 |
|
| 79 |
+
| AR | ViTVQGAN | 650M | 360 | 11.20 | 97.2 |
|
| 80 |
+
| AR | ViTVQGAN | 1.7B | 360 | 5.3 | 149.9 |
|
| 81 |
+
| MIM | MaskGIT | 227M | 300 | 6.18 | 182.1 |
|
| 82 |
+
| MIM | **DiGIT (+MaskGIT)** | 219M | 200 | **4.62** | **146.19** |
|
| 83 |
+
| AR | VQGAN | 227M | 300 | 18.65 | 80.4 |
|
| 84 |
+
| AR | **DiGIT (+VQGAN)** | 219M | 200 | **4.79** | **142.87** |
|
| 85 |
+
| AR | **DiGIT (+VQGAN)** | 732M | 200 | **3.39** | **205.96** |
|
| 86 |
+
|
| 87 |
+
*: VAR is trained with classifier-free guidance while all the other models are not.
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
## Checkpoints
|
| 91 |
+
The K-Means npy file and model checkpoints can be downloaded from:
|
| 92 |
+
|
| 93 |
+
| Model | Link |
|
| 94 |
+
|:----------:|:-----:|
|
| 95 |
+
| HF weightsπ€ | [Huggingface](https://huggingface.co/DAMO-NLP-SG/DiGIT) |
|
| 96 |
+
| Google Drive | [Google Drive](https://drive.google.com/drive/folders/1QWc51HhnZ2G4xI7TkKRanaqXuo8WxUSI?usp=share_link) |
|
| 97 |
+
|
| 98 |
+
For the base model we use [DINOv2-base](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth) and [DINOv2-large](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_reg4_pretrain.pth) for large size model. The VQGAN we use is the same as [MAGE](https://drive.google.com/file/d/13S_unB87n6KKuuMdyMnyExW0G1kplTbP/view?usp=sharing).
|
| 99 |
+
|
| 100 |
+
```
|
| 101 |
+
DiGIT
|
| 102 |
+
βββ data/
|
| 103 |
+
βββ ILSVRC2012
|
| 104 |
+
βββ dinov2_base_short_224_l3
|
| 105 |
+
βββ km_8k.npy
|
| 106 |
+
βββ dinov2_large_short_224_l3
|
| 107 |
+
βββ km_16k.npy
|
| 108 |
+
βββ outputs/
|
| 109 |
+
βββ base_8k_stage1
|
| 110 |
+
βββ ...
|
| 111 |
+
βββ models/
|
| 112 |
+
βββ vqgan_jax_strongaug.ckpt
|
| 113 |
+
βββ dinov2_vitb14_reg4_pretrain.pth
|
| 114 |
+
βββ dinov2_vitl14_reg4_pretrain.pth
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
The training and inference code can be found at our github repo https://github.com/DAMO-NLP-SG/DiGIT
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
## Citation
|
| 121 |
+
|
| 122 |
+
If you find our project useful, hope you can star our repo and cite our work as follows.
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
@misc
|
| 128 |
+
|
| 129 |
+
{zhu2024stabilizelatentspaceimage,
|
| 130 |
+
title={Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective},
|
| 131 |
+
author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
|
| 132 |
+
year={2024},
|
| 133 |
+
eprint={2410.12490},
|
| 134 |
+
archivePrefix={arXiv},
|
| 135 |
+
primaryClass={cs.CV},
|
| 136 |
+
url={https://arxiv.org/abs/2410.12490},
|
| 137 |
+
}
|
| 138 |
+
```
|