Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian,
Ion Stoica, Song Han, Mingyu Gao

Tsinghua University, MIT, UC Berkeley

[Paper] | [Code] | [Slide] | [Poster] | [Flash-TopK-Attention (Stay Tuned)]

Twilight is a composable optimizer to accelerate any existing top-$k$ sparse decoding methods through hierarchical top-$p$ pruning, making them efficient and budget-adaptive.

Key Design: Optimizing Current Algorithm via Hierarchical Top-p Pruning

Traditional top-$k$ based sparse attention can be unified into a Select-then-SpAttn architecture, where:

Selector: usually consists of a fast $q \cdot k$ approximation and a topk operator to filter out the indices.
Sparse Attention: a.k.a Paged Attention, which takes the selected indices as inputs and then calculates the attention only on these tokens.

However, they usually use a fixed budget $k$ of how many tokens to use in their computations. Twilight hacks into the unified architecture by adding a Pruner component right after the Selector called Select-then-Prune architecture in our paper.

By first selecting tokens using a conservative budget using the basic algorithms' Selector and then purning them using top-$p$ pruner, Twilight optimize them with adaptive budget decision capabilities without sacrificing accuracy.

Installation

conda create -n twi python=3.10
conda activate twi
pip install -r requirements.txt
pip install -e .

Note: install flash-attn may take several minutes.

Evaluation

Twilight accelerates SOTA methods like Quest, Double Sparse with nearly zero accuracy loss.

Methods	Longbench (w/o Twilight)	Longbench (w/ Twilight)	Avg. Budget After Pruned
Full (32k)	36.78	38.52(+4.7%)	146
Quest (8192 budget)	37.10	38.04(+2.5%)	131
DS (8192 budget)	36.62	38.71(+5.7%)	126

* Results on Longchat-7B-v1.5-32k

Accuracy Evaluation

We implements a Python version of Twilight and some other existing top-$k$ methods for accuracy-only evaluation. To bench different methods, we use a unified configuration format.

We recommend run the following commands under the benchmark/ directory and the results will be dumped as result_<benchmark_name>/<model_name>/xxx.

Passkey

# Modify MODEL, MODEL_PATH and algo_config_path in scripts/run_passkey.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/run_passkey.sh

Longbench

# Modify MODEL and MODEL_PATH in scripts/run_passkey.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/run_longbench.sh configs/config_quest_1024.json

Efficiency Evaluation

We have organized an implementation of Flash-TopK-Attention using languages such as FlashInfer(CUDA), Triton, and TileLang for the existing top-$k$ algorithm.

Citation

If you find Twilight useful or relevant to your project and research, please kindly cite our paper:

@article{lin2025twilight,
  title={Twilight: Adaptive Attention Sparsity with Hierarchical Top-$ p $ Pruning},
  author={Lin, Chaofan and Tang, Jiaming and Yang, Shuo and Wang, Hanshuo and Tang, Tian and Tian, Boyu and Stoica, Ion and Han, Song and Gao, Mingyu},
  journal={arXiv preprint arXiv:2502.02770},
  year={2025}
}

Acknowledgement

We learned the designs/optimizations and reused code from the following projects: FlashInfer, Quest, Atom, FasterTransformer, QServe. We also thank reserach projects like DuoAttention, PyramidKV, Ada-KV and MagicPIG for bringing the ideas of dynamic budgets across different levels and breaking the limitations of top-$k$ attention.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
benchmark		benchmark
csrc		csrc
figures		figures
twilight		twilight
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Key Design: Optimizing Current Algorithm via Hierarchical Top-p Pruning

Installation

Evaluation

Accuracy Evaluation

Passkey

Longbench

Efficiency Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

tsinghua-ideal/Twilight

Folders and files

Latest commit

History

Repository files navigation

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Key Design: Optimizing Current Algorithm via Hierarchical Top-p Pruning

Installation

Evaluation

Accuracy Evaluation

Passkey

Longbench

Efficiency Evaluation

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages