Skip to content

xiaoshideta/Streaming-dLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding

Project arXiv

Introduction

Streaming-dLLM is a training-free acceleration framework for diffusion language models that supports efficient inference for models like Dream, LLaDA and LLaDA-1.5.

Comparison

Comparison of accuracy and throughput across different acceleration strategies. Our proposed method improves inference throughput while maintaining competitive accuracy compared to prior approaches.

Illustration

Illustration of approximated suffix pruning. For each block, the nearest neighboring region following the current block is retained using a sliding window (red dashed box) and concatenated with the trailing position to form an approximate suffix region.

Installation

  1. Clone this repo.
$ git clone https://github.com/xiaoshideta/Streaming-dLLM.git
$ cd Streaming-dLLM-main
  1. Install all dependencies.
$ conda create -n stream-dllm python=3.10.19
$ pip install -r requirements.txt
$ conda activate stream-dllm

Project Structure

Your project structure should look like this:

|-- <Dream>
|-- <LLaDA-1.5>
|-- <Other>

Usage

Model Weights

Download the Dream model here.

Download the LLaDA model here.

Download the LLaDA-1.5 model here.

Dream

cd Dream
bash eval_dream.sh

LLaDA-1.5

cd LLaDA-1.5
bash eval_llada.sh

LLaDA

Please first replace the llada-1.5 path with llada, and then execute the same script.

bash eval_llada.sh

Performance

Our method achieves 3.7×–13.3× speedup across all benchmarks over the vanilla backbone. Compared with the state-of-the-art acceleration method, it provides 1.5×–2.3× additional speedup on tasks with a generation length of 512. Meanwhile, our accuracy is comparable or slightly better, which demonstrates the effectiveness of our approach.

Benchmark Gen Length Dream dKV-Cache Prefix-Cache Fast-dLLM Ours
HumanEval (0-shot) 256 49.4
20.4 (1×)
48.2*
21.5 (1.1×)
53.7
32.0 (1.6×)
54.3
53.7 (2.6×)
54.3
74.7 (3.7×)
512 54.3
13.7 (1×)
49.4*
15.7 (1.1×)
54.9
24.2 (1.8×)
54.3
40.2 (2.9×)
54.6
72.3 (5.3×)
GSM8K-CoT (5-shot) 256 74.8*
9.0 (1×)
73.6*
17.0 (1.9×)
74.0*
31.5 (3.5×)
73.5*
47.9 (5.3×)
74.0
75.5 (8.4×)
512 74.2*
7.1 (1×)
71.6*
12.8 (1.8×)
74.2*
23.6 (3.3×)
74.1*
41.7 (5.9×)
74.7
94.1 (13.3×)
MBPP (3-shot) 256 56.6
11.0 (1×)
54.0*
14.7 (1.3×)
53.2
32.3 (2.9×)
56.4
67.2 (6.1×)
56.4
80.2 (7.3×)
512 55.6
8.7 (1×)
53.0*
11.6 (1.3×)
53.8
24.5 (2.8×)
55.2
63.1 (7.3×)
55.8
92.4 (10.6×)
MATH (4-shot) 256 38.4
10.5 (1×)
36.8*
14.6 (1.4×)
36.8
32.5 (3.1×)
37.6
62.6 (6.0×)
37.6
78.4 (7.5×)
512 39.8
8.6 (1×)
38.5*
11.6 (1.3×)
38.0
24.5 (2.8×)
39.3
54.4 (6.3×)
39.4
96.0 (11.2×)
Benchmark Gen Length LLaDA-1.5 dKV-Cache Prefix-Cache Fast-dLLM Ours
HumanEval (0-shot) 256 43.9*
6.4 (1×)
40.2*
6.6 (1.0×)
38.4*
10.9 (1.7×)
37.2*
19.1 (3.0×)
39.0
34.1 (5.3×)
512 40.5*
2.9 (1×)
40.2*
3.3 (1.1×)
37.8*
4.8 (1.7×)
39.8*
13.6 (4.7×)
40.2
26.7 (9.2×)
GSM8K (5-shot) 256 80.5*
6.3 (1×)
80.7*
10.8 (1.7×)
80.6*
24.4 (3.9×)
80.7
50.0 (7.9×)
80.8
66.2 (10.5×)
512 81.0*
2.5 (1×)
81.3*
4.2 (1.7×)
81.0*
8.2 (3.3×)
80.4
25.8 (10.3×)
81.2
69.8 (28.0×)
MBPP (3-shot) 256 38.0*
2.2 (1×)
38.2*
3.5 (1.6×)
37.8*
7.6 (3.5×)
37.6*
29.5 (13.4×)
37.8
54.7 (24.9×)
512 38.2*
0.9 (1×)
38.1*
1.5 (1.7×)
38.0*
2.8 (3.1×)
38.1*
16.5 (18.3×)
38.4
61.4 (68.2×)
MATH (4-shot) 256 32.7*
7.8 (1×)
31.8*
12.4 (1.6×)
32.5*
25.9 (3.3×)
32.6
47.1 (6.0×)
33.7
66.2 (8.5×)
512 37.1*
4.8 (1×)
35.1*
7.5 (1.6×)
35.0*
13.9 (2.9×)
35.1
38.3 (7.9×)
35.1
62.4 (13.0×)

Citation

If you find this work useful, please cite our paper:

@misc{xiao2026streamingdllmacceleratingdiffusionllms,
      title={Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding}, 
      author={Zhongyu Xiao and Zhiwei Hao and Jianyuan Guo and Yong Luo and Jia Liu and Jie Xu and Han Hu},
      year={2026},
      eprint={2601.17917},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.17917}, 
}

Acknowledgements

Part of our code is based on Fast-dLLM, LLaDA and Dream, thanks for their excellent work!

About

Diffusion Language Model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published