SDLM-32B-D4
[📂 GitHub] [📜 Tech Report] 🚀 Project Page [🤗 HuggingFace Collection]
Introduction
We propose a Sequential Diffusion Language Model (SDLM), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.
Overall Concept
SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.
- Autoregression: Predicts tokens one by one.
- Diffusion: Regenerates all tokens each step.
- SDLM (ours): Decodes D tokens per step, then keeps the longest consecutive n confident tokens (1 ≤ n ≤ D). Cached tokens are reused, saving computation.
SDLM Family
In the following table, we provide an overview of the SDLM series.
Model Name | Base Model 🤗 | HF Link 🤗 |
---|---|---|
SDLM-3B-D4 | Qwen2.5-3B | https://huggingface.co/OpenGVLab/SDLM-3B-D4 |
SDLM-3B-D8 | Qwen2.5-3B | https://huggingface.co/OpenGVLab/SDLM-3B-D8 |
SDLM-32B-D4 | Qwen2.5-32B | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |
Model Architecture
We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.
- (a) Training pipeline. Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
- (b) Sampling Pipeline. Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.
Performance
Long-Form Benchmarks
SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.
General Mutiple-Choice Benchmarks
Block Size & Self-Speculative Decoding
Trade-off Between Performance and Speed
Trade-off between performance and speed under different confidence thresholds Ï„ for SDLM-3B (B=4) and SDLM-3B (B=8). By adjusting Ï„, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass.
Inference
Install Dependencies
Key package versions:
transformers==4.37.2 torch>=2.5.0
Download the model generation script sdlm_inference.py to your working directory.
We provide an example code to run
SDLM-32B-D4
usingtransformers
.import torch from transformers import AutoModelForCausalLM, AutoTokenizer from sdlm_inference import SDLM_generate if __name__ == "__main__": ckpt_hf = 'OpenGVLab/SDLM-32B-D4' model = AutoModelForCausalLM.from_pretrained( ckpt_hf, attn_implementation="eager", trust_remote_code=True ).to(dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(ckpt_hf) prompt = 'Write a Fibonacci function in Python.' messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) response, history = SDLM_generate( model, tokenizer, model_inputs, max_gen_len = 1024, temperature = 0, threshold = 0.5, n_future_tokens = 4, alg = 'prob_conf', # prob_conf | entropy_conf | self_speculative save_history = True, use_cache = True ) print('response: ', response[0]) print('=======histroy') for item in history: print('cur total token ', item[1]) print(item[0][0]) print('--------')
Train
Environment Setup
git clone https://github.com/OpenGVLab/SDLM.git cd SDLM
Install Dependencies
Key package versions:
transformers==4.37.2 deepspeed==0.16.5 torch>=2.5.0 accelerate==0.32.1
Note: Additional setup is required if using Flex Attention.
Prepare Training Data
The training dataset we used is specified in the meta file: meta.json and is organized in the ShareGPT style, according to the InternVL chat data format.
This dataset is composed of several open-source datasets, with the following structure:
Dataset Name # Sample Domain ScaleQuest-Math 1,000K Math Opc-sft-stage2 436K Code Smoltalk 1,100K General Tulu-3-sft-mixture 939K General SciRIFF 79K Scienece Table-GPT 13K Table Total 3,506K -- Start Training
All training scripts are available in the shell/train directory. Key parameters include:
block_size
: The size of the diffusion window. Current settings use4
, we also try to use8
; larger sizes are under exploration.attn_implementation
: Attention implementation type. Options include sdpa, eager, or flex_attn. Using Flex Attention requires additional setup. Prefer to usesdpa
for a quick start.causal_attn
: Whether to use causal attention within the window. Currently set to non-causal (False
).
More details about training please refer to github.
Evaluation
Currently, we use Opencompass for evaluation. For more details, please refer to the evaluation guide.
Acknowledge
We extend our gratitude to the open-source community for their foundational contributions:
- InternVL The codebase we build upon.
- SMDM, LLaDA, Dream, Block Diffusion for insights into diffusion-based generative modeling.
- Qwen2.5 as a robust base model for comparative studies.
- Opencompass for providing a comprehensive evaluation framework.
- The creators of all datasets used in this work, enabling rigorous training and validation.
Citation
If you find this project useful in your research, please consider citing:
@article{liu2025sdlm,
title={Sequential Diffusion Language Models},
author={Liu, Yangzhou and Cao, Yue and Li, Hao and Luo, Gen and Chen, Zhe and Wang, Weiyun and Liang, Xiaobo and Qi, Biqing and Wu, Lijun and Tian, Changyao and Zhang, Yanting and Li, Yuqiang and Lu, Tong and Qiao, Yu and Dai, Jifeng and Wang, Wenhai},
journal={arXiv preprint arXiv:2509.24007},
year={2025}
}
- Downloads last month
- 78
Model tree for OpenGVLab/SDLM-32B-D4
Base model
Qwen/Qwen2.5-32B