Papers
arxiv:2507.23268

PixNerd: Pixel Neural Field Diffusion

Published on Jul 31
· Submitted by wangsssssss on Aug 4
#2 Paper of the day
Authors:
,
,
,
,

Abstract

Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance.

AI-generated summary

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Community

A new speedy pixel diffusion transformer with neural field!!

TL;DR: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on ImageNet 512×512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Online Space for text-to-image: https://huggingface.co/spaces/MCG-NJU/PixNerd

Revision of the inference time statistics

image.png

Model Inference Training
1 image 1 step Mem (GB) Speed (s/it) Mem (GB)
SiT-L/2(VAE-f8) 0.51s 0.0097s 2.9 0.30 18.4
Baseline-L/16 0.48s 0.0097s 2.1 0.18 18
PixNerd-L/16 0.51s 0.010s 2.1 0.19 22

Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.23268 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 3