SparseD: Sparse Attention for Diffusion Language Models
Abstract
SparseD is a novel sparse attention method for diffusion language models that addresses the high inference latency by pre-computing head-specific sparse patterns and switching to sparse attention in later denoising steps.
While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to 1.50times speedup over FlashAttention at a 64k context length with 1,024 denoising steps.
Community
๐ Arxiv: https://arxiv.org/abs/2509.24014
๐ป Code: https://github.com/INV-WZQ/SparseD
๐ SparseD is a novel sparse attention method for diffusion language models (DLMs), delivering near-lossless acceleration in performance.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Bidirectional Sparse Attention for Faster Video Diffusion Training (2025)
- Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction (2025)
- Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation (2025)
- DLLMQuant: Quantizing Diffusion-based Large Language Models (2025)
- ProxyAttn: Guided Sparse Attention via Representative Heads (2025)
- SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention (2025)
- DPad: Efficient Diffusion Language Models with Suffix Dropout (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper