Add agent-tool-abstention eval (13 samples, Match template) by MukundaKatta · Pull Request #1656 · openai/evals

MukundaKatta · 2026-05-08T06:16:40Z

Eval description

Tests an LLM's ability to abstain from calling a tool when no tool fits the user's request, while still picking the right tool when one does fit. Each sample presents 3-4 thematically related tools alongside a user request; the model must output either the canonical tool name or the literal string NO_TOOL.

This targets a well-documented failure mode in tool-using agents: confabulating a tool call (e.g. answering summarize_text for "tell me a joke about programmers") rather than declining when nothing fits. It's a companion to agent-tool-routing (PR #1655) - that eval measures correct selection assuming the answer is in the list; this one measures correct refusal when it isn't.

Composition

13 samples total:

9 have ideal: NO_TOOL (out-of-scope user requests)
4 have a real tool answer (positive controls so a model that always answers NO_TOOL does not score 100%)

The four positive-control samples are intentionally similar to ones in agent-tool-routing so a passing model has to disambiguate based on the tool list, not on the request alone.

What changed

New samples file: evals/registry/data/agent_tool_abstention/samples.jsonl
New registry entry: evals/registry/evals/agent-tool-abstention.yaml

Uses the existing evals.elsuite.basic.match:Match template, no custom code.

Why eval this

Real-world agent platforms (chat assistants, IDE copilots, MCP clients) all have to handle requests outside their tool surface, and a confabulated tool call is a worse failure than a polite "I cannot do that" response.
Match-based evaluation is cheap and reproducible; the eval is small enough to run on every model bump.
Hand-curated; extendable by adding more .jsonl rows without code changes.

Run

oaieval gpt-4o-mini agent-tool-abstention

Checklist

Eval ID has .dev.v0 suffix
Description and disclaimer present in YAML
Uses an existing eval template (no custom code)
Samples are deterministic, in JSONL with input (chat messages) and ideal
All ideal answers are uniquely identifiable from the prompt

Add agent-tool-abstention eval (13 samples, Match template)

e69385e

MukundaKatta requested review from andrew-openai, etr2460 and katyhshi as code owners May 8, 2026 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent-tool-abstention eval (13 samples, Match template)#1656

Add agent-tool-abstention eval (13 samples, Match template)#1656
MukundaKatta wants to merge 1 commit into
openai:mainfrom
MukundaKatta:add-agent-tool-abstention-eval

MukundaKatta commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MukundaKatta commented May 8, 2026

Eval description

Composition

What changed

Why eval this

Run

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant