PixelFlow: Pixel-Space Generative Models with Flow
Abstract
We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256times256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.
Community
PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models.
have you taken any inspiration from "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (https://arxiv.org/abs/2404.02905) ?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FlowTok: Flowing Seamlessly Across Text and Image Tokens (2025)
- ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration (2025)
- Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens (2025)
- USP: Unified Self-Supervised Pretraining for Image Generation and Understanding (2025)
- DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation (2025)
- Deeply Supervised Flow-Based Generative Models (2025)
- Controlling Latent Diffusion Using Latent CLIP (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I do like the approach. Finding paths not only between same resolution distributions but integrating upscaling into this sounds quite elegant. Though, in my eyes the cojmputational overhead from this will make scaling difficult, right?
Sounds like Next-level de-convolutions, congrats! Add ControlNets into the mix - and it will be a killer model for general use, imho
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper