Adrenaline

Adrenaline is an attention decoupling and offloading mechanism, designed to boost the resource utilization and performance in LLM serving systems. Based on the PD-disaggregation LLM inference paradigm, Adrenaline disaggregates part of the attention computation in the decoding phase and offloads them to prefill instances. This enhances resource utiliaztion and performance while ensuring compliance with user SLOs.

Adrenaline's advantages stem from the following aspects:

Improved memory capacity and bandwidth utilization in prefill instances
Increased decoding batch sizes that enhance compute utilization in decoding instances
Dynamic offloading scheduling to preserve compliance with user SLOs

Figure 1. Comparison of original PD disaggregation and Adrenaline (decode batch size: M v.s. M + N ).

Installation

It's recommend to use uv, a very fast Python environment manager, to create and manage Python environments.

pip install uv ninja pytest
uv pip install -e .
uv pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.5/
python adrenaline/setup.py install

Getting Start

Setup MPS
```
bash adrenaline/scripts/start_mps.sh
```

Prepare minimal running environment (download dummy model weight, profile runtime data)

VLLM_ATTENTION_BACKEND=ADRENALINE_FLASHINFER python -m examples.adrenaline.download_model
bash adrenaline/scripts/profile_attention_bandwidth.sh

Start adrenaline servers (prefill/decode/attention instances)
- For easier to check the instance's status and output, we use tmux to maintain sessions for adrenaline instances, please install tmux before start adrenaline servers.
- You can execute tmux attach adrenaline to check the output of the instances.
```
bash examples/adrenaline/start_demo_servers.sh
```
Start clients and send requests to adrenaline servers
```
bash examples/adrenaline/start_clients.sh
```

Stop adrenaline servers and MPS

bash examples/adrenaline/stop_demo_servers.sh
bash adrenaline/scripts/stop_mps.sh

Evaluation

We list some of the primary evaluation results. Please reference to evaluation guidance and check our paper for more details. We evaluate the performance of Adrenaline on A100 80GB SXM and choose vllm as our baseline. Note that, for simplicity, all frameworks turn off specific optimizations like quantization, speculative decoding, multi-step, etc.. By enhancing the resource utilization of the prefill and decode instances, Adrenaline achieves up to 1.35~1.80x decode throughput boost compared to vllm.

Figure 2. The end-to-end evaluation of Llama-3.1 8B using mooncake dataset.

Figure 3. The end-to-end evaluation of Llama-2 13B using mooncake dataset.

Acknowledgments

This project is built upon vllm. We extend our gratitude to the developers of vLLM. Besides, as we transitioned from an internal development environment to this open-source platform, the detailed commit history and authorship information could not be preserved. Hence, we are taking this opportunity to formally acknowledge all contributors.

Citation

If you find this project useful in your research, please consider citing our paper:

@misc{liang2025injectingadrenalinellmserving,
      title={Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation}, 
      author={Yunkai Liang and Zhangyu Chen and Pengfei Zuo and Zhi Zhou and Xu Chen and Zhou Yu},
      year={2025},
      eprint={2503.20552},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2503.20552}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3,598 Commits
.buildkite		.buildkite
.github		.github
adrenaline		adrenaline
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docs		docs
evaluation		evaluation
examples		examples
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.hpu		Dockerfile.hpu
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.openvino		Dockerfile.openvino
Dockerfile.ppc64le		Dockerfile.ppc64le
Dockerfile.rocm		Dockerfile.rocm
Dockerfile.tpu		Dockerfile.tpu
Dockerfile.xpu		Dockerfile.xpu
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
collect_env.py		collect_env.py
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
pyproject.toml		pyproject.toml
python_only_dev.py		python_only_dev.py
requirements-build.txt		requirements-build.txt
requirements-common.txt		requirements-common.txt
requirements-cpu.txt		requirements-cpu.txt
requirements-cuda.txt		requirements-cuda.txt
requirements-dev.txt		requirements-dev.txt
requirements-hpu.txt		requirements-hpu.txt
requirements-lint.txt		requirements-lint.txt
requirements-neuron.txt		requirements-neuron.txt
requirements-openvino.txt		requirements-openvino.txt
requirements-rocm.txt		requirements-rocm.txt
requirements-test.in		requirements-test.in
requirements-test.txt		requirements-test.txt
requirements-tpu.txt		requirements-tpu.txt
requirements-xpu.txt		requirements-xpu.txt
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py
vllm-README.md		vllm-README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Adrenaline

Installation

Getting Start

Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Contributors 270

Uh oh!

Languages

Uh oh!

License

ASISys/Adrenaline

Folders and files

Latest commit

History

Repository files navigation

Adrenaline

Installation

Getting Start

Evaluation

Acknowledgments

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 270

Uh oh!

Languages

Packages