SDLM-32B-D4

[📂 GitHub] [📜 Tech Report] [🤗 HuggingFace]

Introduction

We propose a Sequential Diffusion Language Model (SDLM), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.

image/png

SDLM Family

In the following table, we provide an overview of the SDLM series.

Model Architecture

We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.

  • (a) Training pipeline. Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
  • (b) Sampling Pipeline. Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.

image/png

Performance

Long-Form Benchmarks

SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.

image/png

General Mutiple-Choice Benchmarks

image/png

Block Size & Self-Speculative Decoding

image/png

Trade-off Between Performance and Speed

Trade-off between performance and speed under different confidence thresholds Ï„ for SDLM-3B (B=4) and SDLM-3B (B=8). By adjusting Ï„, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass.

image/png

Inference

  1. Install Dependencies

    Key package versions:

    transformers==4.37.2
    torch>=2.5.0
    
  2. Download the model generation script sdlm_inference.py to your working directory.

  3. We provide an example code to run SDLM-32B-D4 using transformers.

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from sdlm_inference import SDLM_generate
    
    if __name__ == "__main__":
        ckpt_hf = 'OpenGVLab/SDLM-32B-D4'
    
        model = AutoModelForCausalLM.from_pretrained(
            ckpt_hf, 
            attn_implementation="eager",
            trust_remote_code=True
        ).to(dtype=torch.float16)
        tokenizer = AutoTokenizer.from_pretrained(ckpt_hf)
    
        prompt = 'Write a Fibonacci function in Python.'
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
    
        model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
        response, history = SDLM_generate(
            model,
            tokenizer,
            model_inputs,
            max_gen_len = 1024,
            temperature = 0,
            threshold = 0.5,
            n_future_tokens = 4,
            alg = 'prob_conf', #  prob_conf | entropy_conf | self_speculative
            save_history = True,
            use_cache = True
        )
    
        print('response: ', response[0])
    
        print('=======histroy')
        for item in history:
            print('cur total token ', item[1])
            print(item[0][0])
            print('--------')
    

Citation

If you find this project useful in your research, please consider citing:

@article{liu2025sdlm,
  title={Sequential Diffusion Language Models},
  author={Liu, Yangzhou and Cao, Yue and Li, Hao and Luo, Gen and Chen, Zhe and Wang, Weiyun and Liang, Xiaobo and Qi, Biqing and Wu, Lijun and Tian, Changyao and Zhang, Yanting and Li, Yuqiang and Lu, Tong and Qiao, Yu and Dai, Jifeng and Wang, Wenhai},
  journal={arXiv preprint arXiv:2509.24007},
  year={2025}
}
Downloads last month
3
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenGVLab/SDLM-32B-D4

Base model

Qwen/Qwen2.5-32B
Finetuned
(95)
this model

Datasets used to train OpenGVLab/SDLM-32B-D4

Collection including OpenGVLab/SDLM-32B-D4