Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Abstract
Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves over 100% speedup in decoding time while maintaining the answer quality.
Community
A parallel decoding method for LLM, which can greatly accelerate the generation of parallelizable steps. and almost need no additional memory.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Speculative Decoding for Multi-Sample Inference (2025)
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs (2025)
- Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity (2025)
- DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting (2025)
- APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs (2025)
- LightThinker: Thinking Step-by-Step Compression (2025)
- Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper