Image-to-Image
Diffusers
xingjianleng commited on
Commit
a3dbedc
Β·
verified Β·
1 Parent(s): 64b11ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -17
README.md CHANGED
@@ -4,28 +4,57 @@ library_name: diffusers
4
  pipeline_tag: image-to-image
5
  ---
6
 
7
- ---
8
- license: mit
9
- library_name: diffusers
10
- pipeline_tag: image-to-image
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- # REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
14
 
15
- ## About
16
- This model addresses the question of whether latent diffusion models and their VAE tokenizer can be trained end-to-end. Using a representation-alignment (REPA) loss, REPA-E enables stable and effective joint training of both components, leading to significant training acceleration and improved VAE performance. The resulting E2E-VAE serves as a drop-in replacement for existing VAEs, improving convergence and generation quality across diverse LDM architectures.
17
 
18
- This model is based on the paper [REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers](https://huggingface.co/papers/2504.10483) and its official implementation is available on [Github](https://github.com/REPA-E/REPA-E). The project page can be found at [https://end2end-diffusion.github.io](https://end2end-diffusion.github.io).
 
 
19
 
20
- ## Usage
21
 
22
- To use the REPA-E model, you can load it via the Hugging Face `DiffusionPipeline`. Below is a simplified example of how to use a pretrained REPA-E model for inference. For training examples and further details, please refer to the [Github repository](https://github.com/REPA-E/REPA-E).
 
23
 
24
- ```python
25
- from diffusers import DiffusionPipeline
26
 
27
- pipeline = DiffusionPipeline.from_pretrained("REPA-E/sit-repae-sdvae", trust_remote_code=True)
28
- image = pipeline().images[0]
29
 
30
- image.save("generated_image.png")
31
- ```
 
 
 
 
 
 
 
4
  pipeline_tag: image-to-image
5
  ---
6
 
7
+ <h1 align="center"> REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers </h1>
8
+
9
+ <p align="center">
10
+ <a href="https://scholar.google.com.au/citations?user=GQzvqS4AAAAJ" target="_blank">Xingjian&nbsp;Leng</a><sup>1*</sup> &ensp; <b>&middot;</b> &ensp;
11
+ <a href="https://1jsingh.github.io/" target="_blank">Jaskirat&nbsp;Singh</a><sup>1*</sup> &ensp; <b>&middot;</b> &ensp;
12
+ <a href="https://hou-yz.github.io/" target="_blank">Yunzhong&nbsp;Hou</a><sup>1</sup> &ensp; <b>&middot;</b> &ensp;
13
+ <a href="https://people.csiro.au/X/Z/Zhenchang-Xing/" target="_blank">Zhenchang&nbsp;Xing</a><sup>2</sup>&ensp; <b>&middot;</b> &ensp;
14
+ <a href="https://www.sainingxie.com/" target="_blank">Saining&nbsp;Xie</a><sup>3</sup>&ensp; <b>&middot;</b> &ensp;
15
+ <a href="https://zheng-lab-anu.github.io/" target="_blank">Liang&nbsp;Zheng</a><sup>1</sup>&ensp;
16
+ </p>
17
+
18
+ <p align="center">
19
+ <sup>1</sup> Australian National University &emsp; <sup>2</sup>Data61-CSIRO &emsp; <sup>3</sup>New York University &emsp; <br>
20
+ <sub><sup>*</sup>Project Leads&emsp;</sub>
21
+ </p>
22
+
23
+ <p align="center">
24
+ <a href="https://End2End-Diffusion.github.io">🌐 Project Page</a> &ensp;
25
+ <a href="https://huggingface.co/REPA-E">πŸ€— Models</a> &ensp;
26
+ <a href="https://arxiv.org/abs/2504.10483">πŸ“ƒ Paper</a> &ensp;
27
+ <br>
28
+ <a href="https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=repa-e-unlocking-vae-for-end-to-end-tuning-of"><img src="https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/repa-e-unlocking-vae-for-end-to-end-tuning-of/image-generation-on-imagenet-256x256" alt="PWC"></a>
29
+ </p>
30
+
31
+
32
+ <p align="center">
33
+ <img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/vis-examples.jpg" width="100%" alt="teaser">
34
+ </p>
35
 
36
+ ---
37
 
38
+ We address a fundamental question: ***Can latent diffusion models and their VAE tokenizer be trained end-to-end?*** While training both components jointly with standard diffusion loss is observed to be ineffective β€” often degrading final performance β€” we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, **REPA-E**, enables stable and effective joint training of both the VAE and the diffusion model.
 
39
 
40
+ <p align="center">
41
+ <img src="https://github.com/End2End-Diffusion/REPA-E/raw/main/assets/overview.jpg" width="100%" alt="teaser">
42
+ </p>
43
 
44
+ **REPA-E** significantly accelerates training β€” achieving over **17Γ—** speedup compared to REPA and **45Γ—** over the vanilla training recipe. Interestingly, end-to-end tuning also improves the VAE itself: the resulting **E2E-VAE** provides better latent structure and serves as a **drop-in replacement** for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures. Our method achieves state-of-the-art FID scores on ImageNet 256Γ—256: **1.26** with CFG and **1.83** without CFG.
45
 
46
+
47
+ ## Usage and Training
48
 
49
+ Please refer our [Github Repo](https://github.com/End2End-Diffusion/REPA-E) for detailed notes on end-to-end training and inference using REPA-E.
 
50
 
51
+ ## πŸ“š Citation
 
52
 
53
+ ```bibtex
54
+ @article{leng2025repae,
55
+ title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
56
+ author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
57
+ year={2025},
58
+ journal={arXiv preprint arXiv:2504.10483},
59
+ }
60
+ ```