Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache).
- [2025/06/15] Our dLLM-Cache is compatible with MMaDA.
- [2025/05/31] Our dLLM-Cache is integrated into LLaDA-V.
- [2025/05/23] The code of our paper has been released.
- [2025/05/17] Our paper has been released.
- Currently supported models: LLaDA, Dream, LLaDA-V and MMaDA.
- Speedup: Achieves up to 9.1x speedup over standard dLLM pipelines, with no performance loss on most tasks.
- Evaluation: Evaluated on LLaDA 8B and Dream 7B.
- Latency: Approaches ARM-level inference speeds in many scenarios.
Here's an overview of the process behind our dLLM-Cache method:
To get started with dLLM-Cache, follow the installation instructions below.
- Clone the Repository:
git clone https://github.com/maomaocun/dLLM-Cache.git
cd dLLM-Cache
- Set Up the Environment:
Create a Python environment with
conda
orvirtualenv
and install dependencies:
bash install.sh
- Demo:
python demo_{model_name}.py
- Running Experiments: Run experiments using the provided scripts:
bash scripts/run_{model_name}_{task_name}_base.sh
- GSM8K with LLaDA
bash scripts/run_LLaDA_gsm8k_base.sh
- BBH with Dream
bash scripts/run_Dream_bbh_base.sh
If you have any questions, please email yangyicun187@gmail.com.
This repository was built off of LLaDA, Dream, LLaDA-V, MMaDA and lm-evaluation-harness.
If you find dLLM-Cache useful for your research and applications, please cite using this BibTeX:
@article{liu2025dllm,
title={dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching},
author={Liu, Zhiyuan and Yang, Yicun and Zhang, Yaojie and Chen, Junjie and Zou, Chang and Wei, Qingyuan and Wang, Shaobo and Zhang, Linfeng},
journal={arXiv preprint arXiv:2506.06295},
year={2025}
}