Skip to content

tsinghua-ideal/Twilight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Tsinghua University, MIT, UC Berkeley

teaser

Twilight is a composable optimizer to accelerate any existing top-$k$ sparse decoding methods through hierarchical top-$p$ pruning, making them efficient and budget-adaptive.

Key Design: Optimizing Current Algorithm via Hierarchical Top-p Pruning

Traditional top-$k$ based sparse attention can be unified into a Select-then-SpAttn architecture, where:

  • Selector: usually consists of a fast $q \cdot k$ approximation and a topk operator to filter out the indices.
  • Sparse Attention: a.k.a Paged Attention, which takes the selected indices as inputs and then calculates the attention only on these tokens.

However, they usually use a fixed budget $k$ of how many tokens to use in their computations. Twilight hacks into the unified architecture by adding a Pruner component right after the Selector called Select-then-Prune architecture in our paper.

By first selecting tokens using a conservative budget using the basic algorithms' Selector and then purning them using top-$p$ pruner, Twilight optimize them with adaptive budget decision capabilities without sacrificing accuracy.

arch

Installation

conda create -n twi python=3.10
conda activate twi
pip install -r requirements.txt
pip install -e .

Note: install flash-attn may take several minutes.

Evaluation

Twilight accelerates SOTA methods like Quest, Double Sparse with nearly zero accuracy loss.

Methods Longbench (w/o Twilight) Longbench (w/ Twilight) Avg. Budget After Pruned
Full (32k) 36.78 38.52(+4.7%) 146
Quest (8192 budget) 37.10 38.04(+2.5%) 131
DS (8192 budget) 36.62 38.71(+5.7%) 126

eva1

* Results on Longchat-7B-v1.5-32k

Accuracy Evaluation

We implements a Python version of Twilight and some other existing top-$k$ methods for accuracy-only evaluation. To bench different methods, we use a unified configuration format.

We recommend run the following commands under the benchmark/ directory and the results will be dumped as result_<benchmark_name>/<model_name>/xxx.

Passkey

# Modify MODEL, MODEL_PATH and algo_config_path in scripts/run_passkey.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/run_passkey.sh

Longbench

# Modify MODEL and MODEL_PATH in scripts/run_passkey.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/run_longbench.sh configs/config_quest_1024.json

Efficiency Evaluation

We have organized an implementation of Flash-TopK-Attention using languages such as FlashInfer(CUDA), Triton, and TileLang for the existing top-$k$ algorithm.

Citation

If you find Twilight useful or relevant to your project and research, please kindly cite our paper:

@article{lin2025twilight,
  title={Twilight: Adaptive Attention Sparsity with Hierarchical Top-$ p $ Pruning},
  author={Lin, Chaofan and Tang, Jiaming and Yang, Shuo and Wang, Hanshuo and Tang, Tian and Tian, Boyu and Stoica, Ion and Han, Song and Gao, Mingyu},
  journal={arXiv preprint arXiv:2502.02770},
  year={2025}
}

Acknowledgement

We learned the designs/optimizations and reused code from the following projects: FlashInfer, Quest, Atom, FasterTransformer, QServe. We also thank reserach projects like DuoAttention, PyramidKV, Ada-KV and MagicPIG for bringing the ideas of dynamic budgets across different levels and breaking the limitations of top-$k$ attention.

About

[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published