Token Efficiency Benchmarking

Benchmarks that crown the “best” reasoning LLMs routinely publish a single headline score, yet ignore how many tokens—and therefore how much time, energy, and money—the model spent to achieve that score. When practitioners must meet strict service‑level objectives (SLOs) for latency and cost, a system that needs five‑times more tokens per answer is rarely competitive, even if its accuracy is marginally higher. Recent work on agent evaluation has highlighted similar blind spots by arguing for cost‑controlled leaderboards and Pareto visualizations of accuracy versus dollars. Building on this insight, we must shift the focus from dollars to the more fundamental currency of inference: tokens.

Token Efficiency Benchmarking (TEB) will be a systematic framework that measures the quality–token trade‑off of reasoning LLMs and test‑time compute optimization techniques. For each model or technique, TEB will record 
(i) task‑level quality metrics—e.g., exact‑string accuracy on GSM‑8K or pass@k on HumanEval—and 
(ii) the total number of prompt, generation, and tool‑use tokens consumed to reach the reported score. 

The ask is for implementing TEB in guidellm.

cc @rgreenberg1 @sjmonson @markurtz @dagrayvid 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token Efficiency Benchmarking #179

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Token Efficiency Benchmarking #179

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions