Ion Stoica, Song Han, Mingyu Gao
Twilight is a composable optimizer to accelerate any existing top-$k$ sparse decoding methods through hierarchical top-$p$ pruning, making them efficient and budget-adaptive.
Traditional top-$k$ based sparse attention can be unified into a Select-then-SpAttn architecture, where:
-
Selector: usually consists of a fast
$q \cdot k$ approximation and atopkoperator to filter out the indices. - Sparse Attention: a.k.a Paged Attention, which takes the selected indices as inputs and then calculates the attention only on these tokens.
However, they usually use a fixed budget
By first selecting tokens using a conservative budget using the basic algorithms' Selector and then purning them using top-$p$ pruner, Twilight optimize them with adaptive budget decision capabilities without sacrificing accuracy.
conda create -n twi python=3.10
conda activate twi
pip install -r requirements.txt
pip install -e .Note: install flash-attn may take several minutes.
Twilight accelerates SOTA methods like Quest, Double Sparse with nearly zero accuracy loss.
| Methods | Longbench (w/o Twilight) | Longbench (w/ Twilight) | Avg. Budget After Pruned |
|---|---|---|---|
| Full (32k) | 36.78 | 38.52(+4.7%) | 146 |
| Quest (8192 budget) | 37.10 | 38.04(+2.5%) | 131 |
| DS (8192 budget) | 36.62 | 38.71(+5.7%) | 126 |
* Results on Longchat-7B-v1.5-32k
We implements a Python version of Twilight and some other existing top-$k$ methods for accuracy-only evaluation. To bench different methods, we use a unified configuration format.
We recommend run the following commands under the benchmark/ directory and the results will be dumped as result_<benchmark_name>/<model_name>/xxx.
# Modify MODEL, MODEL_PATH and algo_config_path in scripts/run_passkey.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/run_passkey.sh# Modify MODEL and MODEL_PATH in scripts/run_passkey.sh
CUDA_VISIBLE_DEVICES=0 bash scripts/run_longbench.sh configs/config_quest_1024.jsonWe have organized an implementation of Flash-TopK-Attention using languages such as FlashInfer(CUDA), Triton, and TileLang for the existing top-$k$ algorithm.
If you find Twilight useful or relevant to your project and research, please kindly cite our paper:
@article{lin2025twilight,
title={Twilight: Adaptive Attention Sparsity with Hierarchical Top-$ p $ Pruning},
author={Lin, Chaofan and Tang, Jiaming and Yang, Shuo and Wang, Hanshuo and Tang, Tian and Tian, Boyu and Stoica, Ion and Han, Song and Gao, Mingyu},
journal={arXiv preprint arXiv:2502.02770},
year={2025}
}We learned the designs/optimizations and reused code from the following projects: FlashInfer, Quest, Atom, FasterTransformer, QServe. We also thank reserach projects like DuoAttention, PyramidKV, Ada-KV and MagicPIG for bringing the ideas of dynamic budgets across different levels and breaking the limitations of top-$k$ attention.



