Skip to content

fix: extract \boxed{} from model response in hendrycks_math#3644

Open
NezLheimeur wants to merge 3 commits intoEleutherAI:mainfrom
NezLheimeur:fix/hendrycks-math-boxed-extraction
Open

fix: extract \boxed{} from model response in hendrycks_math#3644
NezLheimeur wants to merge 3 commits intoEleutherAI:mainfrom
NezLheimeur:fix/hendrycks-math-boxed-extraction

Conversation

@NezLheimeur
Copy link
Copy Markdown

@NezLheimeur NezLheimeur commented Mar 20, 2026

Problem

process_results in hendrycks_math/utils.py applies remove_boxed(last_boxed_only_string(...)) to the target but never to the model's response. Instruct models commonly output \boxed{} formatted answers, but the extraction only looks for $...$ delimiters in the response.

This causes correct answers to score 0. For example, when a model outputs "The domain is \boxed{[2,5)}", the extraction gets the entire string (no $ signs found), and is_equiv fails against "[2,5)".

Fix

  1. Single \boxed{} extraction: Try \boxed{} extraction on the response first, fall back to $...$. Aligns with lm_eval/tasks/aime/utils.py.
  2. Multiple \boxed{} extraction: Handle cases where models output separate boxed answers (e.g. \boxed{3}, \boxed{5}, \boxed{7}3, 5, 7). Duplicates are deduplicated to handle models that repeat the final answer.
  3. Version bump: metadata.version bumped to 1.1 in YAML files.
  4. README: Added changelog entry for v1.1.

Impact

Affects hendrycks_math and hendrycks_math500 tasks with any model that outputs \boxed{} formatted answers (standard for math instruct models).

Fixes #3643
Partially addresses #3652 (multi-boxed answers)

@NezLheimeur NezLheimeur requested a review from 0xSMT as a code owner March 20, 2026 14:56
Copy link
Copy Markdown
Contributor

@fxmarty-amd fxmarty-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I confirm this fixes zero accuracy.

You should probably update the README and

metadata:
  version: 1.0

in yaml files

This fix looks similar to: #3192

@NezLheimeur
Copy link
Copy Markdown
Author

@fxmarty-amd thanks for the review. I've updated the readme and associated yaml files.

process_results only applied remove_boxed/last_boxed_only_string to
the target (ground truth), never to the model's response. When instruct
models output answers in \boxed{} format, the answer extraction fell
back to $...$ matching or used the entire response verbatim.

This aligns hendrycks_math with the aime task which already extracts
\boxed{} from the response.

Fixes EleutherAI#3643
Add changelog entry for the \boxed{} extraction fix.
Handle cases where models output multiple \boxed{} answers
(e.g. \boxed{3}, \boxed{5}, \boxed{7}) by extracting all
occurrences and joining them. Duplicates are removed to handle
models that repeat the final answer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hendrycks_math: process_results does not extract \boxed{} from model response

2 participants