Ranajoy Sadhukhan*, Zhuoming Chen*, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Carnegie Mellon University
We introduce Kinetics, which challenges the traditional test-time scaling (TTS) laws by adopting a practical efficiency perspective. It reveals that prior compute-optimal approaches overlook major key-value memory access bottlenecks in various TTS strategies. By jointly considering memory and compute, the Kinetics scaling law shows that it is more efficient to scale model size up to a threshold before investing more compute in test-time scaling. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget.
![]() | ![]() |
In this codebase, we provide,
- eFLOPs-based cost model analysis
- Comparison of previous and Kinetics scaling law
- Comparison of dense and sparse scaling law based on eFLOPs based cost model
- Block Top-K attention speed-up demonstration
We provide AIME24 and AIME25 reasoning traces for Qwen3 model series and its sparse attention variants through Huggingface.
The repository is organized as follows,
Kinetics/
├── LiteSys/
| ├──examples/
| | └── eval_math.sh
| └──litesys/
| | ├── attention/
| | ├── engine/
| | └── models/
├── benchmark/
| ├──blocktopk.py
| └──dense.py
├── cost_model
| ├── best_of_N
| | ├── cost_model_ntrial.py
| | ├── frontier_numTrials.ipynb
| | └── compare_sparse_numTrials.ipynb
| ├── long_CoT
| | ├── cost_model_genlen.py
| | ├── frontier_genlen.ipynb
| | └── compare_sparse_genlen.ipynb
| ├── utils.py
├── README.md
└── requirements.txt
cost_model/
: contains code to perform eFLOPs-based cost analysis for two different inference-scaling strategies - best of N and long CoT.benchmark/
: contains code for benchmarking the task throughput with block top-k and dense attention-based model.litesys/
: contains code for performing reasoning benchmarking (contains continuous batching and sparse attention implementations (yet to be optimized)).
conda create -n kinetics python=3.11
conda activate kinetics
# install flashinfer v0.1.6
wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu124torch2.4-cp311-cp311-linux_x86_64.whl#sha256=19a01e2ec93662bc6b83819daaae277d93e7cc989343c5f8940af44a4cb66ba0
pip install -r requirements.txt
Download the reasoning traces from Huggingface
The samples are saved as jsonl files. Every example dict contains the following keys:
query
: question textchoice
: ground-truth output listprediction
: output prediction text listscore
: 0.0 or 100.0 Each example is replicated 32 times (32 max trials). For 32B, we only provide 8 trials.
Compute inference cost and accuracies for different test-time configurations (number of trials and generation lengths).
cd cost_model
python3 best_of_N/cost_model_ntrial.py <path_to_response_root> <method>
python3 best_of_N/cost_model_genlen.py <path_to_response_root> <method>
Args:
<response_root>
: Dataset directory (e.g.,AIME24
,AIME25
)<method>
: One ofdense
,topk
,blocktopk
First generate cost analysis csv files using the scripts mentioned above. Then refer to cost_model/best_of_N/frontier_numTrials.ipynb
and cost_model/long_CoT/frontier_genlen.ipynb
to obtain the accuracy vs cost budget pareto curves for dense and the sparse variants using best of N and long CoT scaling stratgies respectively.
![]() |
![]() |
We provide a benchmarking code for reasoning tasks including AIME24, AIME25 and LiveCodeBench in the form of LiteSys. Currently it supports continuous batching and sparse attention kernels.
Note, the sparsity kernels are not optimized yet. We are currently working on integrating efficient implementations into SGLang.
Contains paged-attention implementation of dense attention and block top-k attention. We only provide implementation for 1 decoder layer, considering n-way Tensor Parallelism (n is the number of key-value heads).
python3 dense.py --model Qwen/Qwen3-8B --gen_len 32768 --batch_size 4096 --page_size 16 --world_size 8
python3 blocktopk.py --model Qwen/Qwen3-8B --gen_len 32768 --batch_size 4096 --page_size 16 --topk_page 64 --world_size 8
- Add SGLang based evaluation
@misc{sadhukhan2025kineticsrethinkingtesttimescaling,
title={Kinetics: Rethinking Test-Time Scaling Laws},
author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen},
year={2025},
eprint={2506.05333},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.05333},
}