Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
Abstract
A new diffusion forcing sampler accelerates token generation in recurrent-depth language models, offering a 5x speedup without tuning.
Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self Speculative Decoding for Diffusion Large Language Models (2025)
- Set Block Decoding is a Language Model Inference Accelerator (2025)
- Fast-dLLM v2: Efficient Block-Diffusion LLM (2025)
- Sequential Diffusion Language Models (2025)
- Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States (2025)
- Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models (2025)
- dParallel: Learnable Parallel Decoding for dLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
