Adrenaline is an attention decoupling and offloading mechanism, designed to boost the resource utilization and performance in LLM serving systems. Based on the PD-disaggregation LLM inference paradigm, Adrenaline disaggregates part of the attention computation in the decoding phase and offloads them to prefill instances. This enhances resource utiliaztion and performance while ensuring compliance with user SLOs.
Adrenaline's advantages stem from the following aspects:
- Improved memory capacity and bandwidth utilization in prefill instances
- Increased decoding batch sizes that enhance compute utilization in decoding instances
- Dynamic offloading scheduling to preserve compliance with user SLOs
Figure 1. Comparison of original PD disaggregation and Adrenaline (decode batch size: M v.s. M + N ).
It's recommend to use uv, a very fast Python environment manager, to create and manage Python environments.
pip install uv ninja pytest
uv pip install -e .
uv pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.5/
python adrenaline/setup.py install
-
Setup MPS
bash adrenaline/scripts/start_mps.sh
-
Prepare minimal running environment (download dummy model weight, profile runtime data)
VLLM_ATTENTION_BACKEND=ADRENALINE_FLASHINFER python -m examples.adrenaline.download_model bash adrenaline/scripts/profile_attention_bandwidth.sh
-
Start adrenaline servers (prefill/decode/attention instances)
- For easier to check the instance's status and output, we use
tmux
to maintain sessions for adrenaline instances, please installtmux
before start adrenaline servers. - You can execute
tmux attach adrenaline
to check the output of the instances.
bash examples/adrenaline/start_demo_servers.sh
- For easier to check the instance's status and output, we use
-
Start clients and send requests to adrenaline servers
bash examples/adrenaline/start_clients.sh
-
Stop adrenaline servers and MPS
bash examples/adrenaline/stop_demo_servers.sh bash adrenaline/scripts/stop_mps.sh
We list some of the primary evaluation results. Please reference to evaluation guidance and check our paper for more details. We evaluate the performance of Adrenaline on A100 80GB SXM and choose vllm as our baseline. Note that, for simplicity, all frameworks turn off specific optimizations like quantization, speculative decoding, multi-step, etc.. By enhancing the resource utilization of the prefill and decode instances, Adrenaline achieves up to 1.35~1.80x decode throughput boost compared to vllm.
Figure 2. The end-to-end evaluation of Llama-3.1 8B using mooncake dataset.
Figure 3. The end-to-end evaluation of Llama-2 13B using mooncake dataset.
This project is built upon vllm. We extend our gratitude to the developers of vLLM. Besides, as we transitioned from an internal development environment to this open-source platform, the detailed commit history and authorship information could not be preserved. Hence, we are taking this opportunity to formally acknowledge all contributors.
If you find this project useful in your research, please consider citing our paper:
@misc{liang2025injectingadrenalinellmserving,
title={Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation},
author={Yunkai Liang and Zhangyu Chen and Pengfei Zuo and Zhi Zhou and Xu Chen and Zhou Yu},
year={2025},
eprint={2503.20552},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2503.20552},
}