Skip to content

Add agent-tool-abstention eval (13 samples, Match template)#1656

Open
MukundaKatta wants to merge 1 commit into
openai:mainfrom
MukundaKatta:add-agent-tool-abstention-eval
Open

Add agent-tool-abstention eval (13 samples, Match template)#1656
MukundaKatta wants to merge 1 commit into
openai:mainfrom
MukundaKatta:add-agent-tool-abstention-eval

Conversation

@MukundaKatta
Copy link
Copy Markdown

Eval description

Tests an LLM's ability to abstain from calling a tool when no tool fits the user's request, while still picking the right tool when one does fit. Each sample presents 3-4 thematically related tools alongside a user request; the model must output either the canonical tool name or the literal string NO_TOOL.

This targets a well-documented failure mode in tool-using agents: confabulating a tool call (e.g. answering summarize_text for "tell me a joke about programmers") rather than declining when nothing fits. It's a companion to agent-tool-routing (PR #1655) - that eval measures correct selection assuming the answer is in the list; this one measures correct refusal when it isn't.

Composition

13 samples total:

  • 9 have ideal: NO_TOOL (out-of-scope user requests)
  • 4 have a real tool answer (positive controls so a model that always answers NO_TOOL does not score 100%)

The four positive-control samples are intentionally similar to ones in agent-tool-routing so a passing model has to disambiguate based on the tool list, not on the request alone.

What changed

  • New samples file: evals/registry/data/agent_tool_abstention/samples.jsonl
  • New registry entry: evals/registry/evals/agent-tool-abstention.yaml

Uses the existing evals.elsuite.basic.match:Match template, no custom code.

Why eval this

  • Real-world agent platforms (chat assistants, IDE copilots, MCP clients) all have to handle requests outside their tool surface, and a confabulated tool call is a worse failure than a polite "I cannot do that" response.
  • Match-based evaluation is cheap and reproducible; the eval is small enough to run on every model bump.
  • Hand-curated; extendable by adding more .jsonl rows without code changes.

Run

oaieval gpt-4o-mini agent-tool-abstention

Checklist

  • Eval ID has .dev.v0 suffix
  • Description and disclaimer present in YAML
  • Uses an existing eval template (no custom code)
  • Samples are deterministic, in JSONL with input (chat messages) and ideal
  • All ideal answers are uniquely identifiable from the prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant