Streaming-dLLM is a training-free acceleration framework for diffusion language models that supports efficient inference for models like Dream, LLaDA and LLaDA-1.5.
Comparison of accuracy and throughput across different acceleration strategies. Our proposed method improves inference throughput while maintaining competitive accuracy compared to prior approaches.
Illustration of approximated suffix pruning. For each block, the nearest neighboring region following the current block is retained using a sliding window (red dashed box) and concatenated with the trailing position to form an approximate suffix region.
- Clone this repo.
$ git clone https://github.com/xiaoshideta/Streaming-dLLM.git
$ cd Streaming-dLLM-main
- Install all dependencies.
$ conda create -n stream-dllm python=3.10.19
$ pip install -r requirements.txt
$ conda activate stream-dllm
Your project structure should look like this:
|-- <Dream>
|-- <LLaDA-1.5>
|-- <Other>Download the Dream model here.
Download the LLaDA model here.
Download the LLaDA-1.5 model here.
cd Dream
bash eval_dream.shcd LLaDA-1.5
bash eval_llada.shPlease first replace the llada-1.5 path with llada, and then execute the same script.
bash eval_llada.shOur method achieves 3.7×–13.3× speedup across all benchmarks over the vanilla backbone. Compared with the state-of-the-art acceleration method, it provides 1.5×–2.3× additional speedup on tasks with a generation length of 512. Meanwhile, our accuracy is comparable or slightly better, which demonstrates the effectiveness of our approach.
| Benchmark | Gen Length | Dream | dKV-Cache | Prefix-Cache | Fast-dLLM | Ours |
|---|---|---|---|---|---|---|
| HumanEval (0-shot) | 256 | 49.4 20.4 (1×) |
48.2* 21.5 (1.1×) |
53.7 32.0 (1.6×) |
54.3 53.7 (2.6×) |
54.3 74.7 (3.7×) |
| 512 | 54.3 13.7 (1×) |
49.4* 15.7 (1.1×) |
54.9 24.2 (1.8×) |
54.3 40.2 (2.9×) |
54.6 72.3 (5.3×) |
|
| GSM8K-CoT (5-shot) | 256 | 74.8* 9.0 (1×) |
73.6* 17.0 (1.9×) |
74.0* 31.5 (3.5×) |
73.5* 47.9 (5.3×) |
74.0 75.5 (8.4×) |
| 512 | 74.2* 7.1 (1×) |
71.6* 12.8 (1.8×) |
74.2* 23.6 (3.3×) |
74.1* 41.7 (5.9×) |
74.7 94.1 (13.3×) |
|
| MBPP (3-shot) | 256 | 56.6 11.0 (1×) |
54.0* 14.7 (1.3×) |
53.2 32.3 (2.9×) |
56.4 67.2 (6.1×) |
56.4 80.2 (7.3×) |
| 512 | 55.6 8.7 (1×) |
53.0* 11.6 (1.3×) |
53.8 24.5 (2.8×) |
55.2 63.1 (7.3×) |
55.8 92.4 (10.6×) |
|
| MATH (4-shot) | 256 | 38.4 10.5 (1×) |
36.8* 14.6 (1.4×) |
36.8 32.5 (3.1×) |
37.6 62.6 (6.0×) |
37.6 78.4 (7.5×) |
| 512 | 39.8 8.6 (1×) |
38.5* 11.6 (1.3×) |
38.0 24.5 (2.8×) |
39.3 54.4 (6.3×) |
39.4 96.0 (11.2×) |
| Benchmark | Gen Length | LLaDA-1.5 | dKV-Cache | Prefix-Cache | Fast-dLLM | Ours |
|---|---|---|---|---|---|---|
| HumanEval (0-shot) | 256 | 43.9* 6.4 (1×) |
40.2* 6.6 (1.0×) |
38.4* 10.9 (1.7×) |
37.2* 19.1 (3.0×) |
39.0 34.1 (5.3×) |
| 512 | 40.5* 2.9 (1×) |
40.2* 3.3 (1.1×) |
37.8* 4.8 (1.7×) |
39.8* 13.6 (4.7×) |
40.2 26.7 (9.2×) |
|
| GSM8K (5-shot) | 256 | 80.5* 6.3 (1×) |
80.7* 10.8 (1.7×) |
80.6* 24.4 (3.9×) |
80.7 50.0 (7.9×) |
80.8 66.2 (10.5×) |
| 512 | 81.0* 2.5 (1×) |
81.3* 4.2 (1.7×) |
81.0* 8.2 (3.3×) |
80.4 25.8 (10.3×) |
81.2 69.8 (28.0×) |
|
| MBPP (3-shot) | 256 | 38.0* 2.2 (1×) |
38.2* 3.5 (1.6×) |
37.8* 7.6 (3.5×) |
37.6* 29.5 (13.4×) |
37.8 54.7 (24.9×) |
| 512 | 38.2* 0.9 (1×) |
38.1* 1.5 (1.7×) |
38.0* 2.8 (3.1×) |
38.1* 16.5 (18.3×) |
38.4 61.4 (68.2×) |
|
| MATH (4-shot) | 256 | 32.7* 7.8 (1×) |
31.8* 12.4 (1.6×) |
32.5* 25.9 (3.3×) |
32.6 47.1 (6.0×) |
33.7 66.2 (8.5×) |
| 512 | 37.1* 4.8 (1×) |
35.1* 7.5 (1.6×) |
35.0* 13.9 (2.9×) |
35.1 38.3 (7.9×) |
35.1 62.4 (13.0×) |
If you find this work useful, please cite our paper:
@misc{xiao2026streamingdllmacceleratingdiffusionllms,
title={Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding},
author={Zhongyu Xiao and Zhiwei Hao and Jianyuan Guo and Yong Luo and Jia Liu and Jie Xu and Han Hu},
year={2026},
eprint={2601.17917},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.17917},
}Part of our code is based on Fast-dLLM, LLaDA and Dream, thanks for their excellent work!