Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Abstract
Seed Diffusion Preview, a discrete-state diffusion language model, achieves fast inference speeds through parallel generation, outperforming Mercury and Gemini Diffusion in speed and quality.
We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
Community
We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
That's so coooooool
On the main graph what would the token per second of seed coder instruct be on the same hardware?
It's confusing there is isn't a clear and direct throughput comparison between this model and a autoregressive one
Hi, Thanks for your interest. We just do the evaluation of seed-coder-instruct over our deployment settings, the speed is 344 token/s. And good suggestions! will consider updating the main fig : )
nice work
nice work
This looks amazing! Any plans to open source this ? ๐
Pretty cool! I recently added support for LLaDA and Dream models in llama.cpp, would love to add support if you ever plan to open source the inference code!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mercury: Ultra-Fast Language Models Based on Diffusion (2025)
- Discrete Diffusion in Large Language and Multimodal Models: A Survey (2025)
- Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles (2025)
- DIFFA: Large Language Diffusion Models Can Listen and Understand (2025)
- Discrete Diffusion Models for Language Generation (2025)
- Plan for Speed - Dilated Scheduling for Masked Diffusion Language Models (2025)
- Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Pretty cool! I recently added support for LLaDA and Dream models in llama.cpp, would love to add support if you ever plan to open source the inference code!
Wow, excellent work! May I ask for the link for LLaDA and Dream in llama.cpp?
Very cool getting faster results than gemini! Would love to play around with this if you open source it!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper