-
Notifications
You must be signed in to change notification settings - Fork 92
Description
Benchmarks that crown the “best” reasoning LLMs routinely publish a single headline score, yet ignore how many tokens—and therefore how much time, energy, and money—the model spent to achieve that score. When practitioners must meet strict service‑level objectives (SLOs) for latency and cost, a system that needs five‑times more tokens per answer is rarely competitive, even if its accuracy is marginally higher. Recent work on agent evaluation has highlighted similar blind spots by arguing for cost‑controlled leaderboards and Pareto visualizations of accuracy versus dollars. Building on this insight, we must shift the focus from dollars to the more fundamental currency of inference: tokens.
Token Efficiency Benchmarking (TEB) will be a systematic framework that measures the quality–token trade‑off of reasoning LLMs and test‑time compute optimization techniques. For each model or technique, TEB will record
(i) task‑level quality metrics—e.g., exact‑string accuracy on GSM‑8K or pass@k on HumanEval—and
(ii) the total number of prompt, generation, and tool‑use tokens consumed to reach the reported score.
The ask is for implementing TEB in guidellm.