arxiv:2504.07963

PixelFlow: Pixel-Space Generative Models with Flow

Published on Apr 10

· Submitted by

Authors:

Abstract

We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256times256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.

View arXiv page View PDF GitHub repository Add to collection

Community

ShoufaChen

Paper submitter 1 day ago

PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models.

Danielmomen

1 day ago

•

edited 1 day ago

have you taken any inspiration from "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (https://arxiv.org/abs/2404.02905) ?

librarian-bot

about 16 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

duinamit

about 9 hours ago

I do like the approach. Finding paths not only between same resolution distributions but integrating upscaling into this sounds quite elegant. Though, in my eyes the cojmputational overhead from this will make scaling difficult, right?