Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
Abstract
Speculative Jacobi-Denoising Decoding accelerates autoregressive text-to-image generation by enabling parallel token prediction and reducing model forward passes.
As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization (2025)
- Fast-dLLM v2: Efficient Block-Diffusion LLM (2025)
- Self Speculative Decoding for Diffusion Large Language Models (2025)
- JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation (2025)
- Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding (2025)
- ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View (2025)
- Dream 7B: Diffusion Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper