Skip to content

Token Efficiency Benchmarking #179

@ashishkamra

Description

@ashishkamra

Benchmarks that crown the “best” reasoning LLMs routinely publish a single headline score, yet ignore how many tokens—and therefore how much time, energy, and money—the model spent to achieve that score. When practitioners must meet strict service‑level objectives (SLOs) for latency and cost, a system that needs five‑times more tokens per answer is rarely competitive, even if its accuracy is marginally higher. Recent work on agent evaluation has highlighted similar blind spots by arguing for cost‑controlled leaderboards and Pareto visualizations of accuracy versus dollars. Building on this insight, we must shift the focus from dollars to the more fundamental currency of inference: tokens.

Token Efficiency Benchmarking (TEB) will be a systematic framework that measures the quality–token trade‑off of reasoning LLMs and test‑time compute optimization techniques. For each model or technique, TEB will record
(i) task‑level quality metrics—e.g., exact‑string accuracy on GSM‑8K or pass@k on HumanEval—and
(ii) the total number of prompt, generation, and tool‑use tokens consumed to reach the reported score.

The ask is for implementing TEB in guidellm.

cc @rgreenberg1 @sjmonson @markurtz @dagrayvid

Metadata

Metadata

Assignees

No one assigned

    Labels

    internalfiled by core contributor or associate

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions