Abstract
dParallel is a method that enhances the parallel decoding of diffusion large language models, significantly reducing decoding steps without compromising performance.
Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at https://github.com/czg1225/dParallel
Community
We present dParallel, a novel method that unlocks the inherent parallelism of dLLMs for fast sampling. Our paper, code, models, and dataset are all available now!
Code: https://github.com/czg1225/dParallel
Paper: https://arxiv.org/pdf/2509.26488
Model: https://huggingface.co/Zigeng/dParallel-LLaDA-8B-instruct
Data: https://huggingface.co/datasets/Zigeng/dParallel_LLaDA_Distill_Data
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing (2025)
- Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding (2025)
- Sequential Diffusion Language Models (2025)
- Diffusion Language Models Know the Answer Before Decoding (2025)
- LLaDA-MoE: A Sparse MoE Diffusion Language Model (2025)
- Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction (2025)
- FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper