fix: extract \boxed{} from model response in hendrycks_math by NezLheimeur · Pull Request #3644 · EleutherAI/lm-evaluation-harness

NezLheimeur · 2026-03-20T14:56:53Z

Problem

process_results in hendrycks_math/utils.py applies remove_boxed(last_boxed_only_string(...)) to the target but never to the model's response. Instruct models commonly output \boxed{} formatted answers, but the extraction only looks for $...$ delimiters in the response.

This causes correct answers to score 0. For example, when a model outputs "The domain is \boxed{[2,5)}", the extraction gets the entire string (no $ signs found), and is_equiv fails against "[2,5)".

Fix

Single \boxed{} extraction: Try \boxed{} extraction on the response first, fall back to $...$ . Aligns with lm_eval/tasks/aime/utils.py.
Multiple \boxed{} extraction: Handle cases where models output separate boxed answers (e.g. \boxed{3}, \boxed{5}, \boxed{7} → 3, 5, 7). Duplicates are deduplicated to handle models that repeat the final answer.
Version bump: metadata.version bumped to 1.1 in YAML files.
README: Added changelog entry for v1.1.

Impact

Affects hendrycks_math and hendrycks_math500 tasks with any model that outputs \boxed{} formatted answers (standard for math instruct models).

Fixes #3643
Partially addresses #3652 (multi-boxed answers)

fxmarty-amd

LGTM, I confirm this fixes zero accuracy.

You should probably update the README and

metadata:
  version: 1.0

in yaml files

This fix looks similar to: #3192

NezLheimeur · 2026-03-25T16:34:57Z

@fxmarty-amd thanks for the review. I've updated the readme and associated yaml files.

process_results only applied remove_boxed/last_boxed_only_string to the target (ground truth), never to the model's response. When instruct models output answers in \boxed{} format, the answer extraction fell back to $...$ matching or used the entire response verbatim. This aligns hendrycks_math with the aime task which already extracts \boxed{} from the response. Fixes EleutherAI#3643

Add changelog entry for the \boxed{} extraction fix.

Handle cases where models output multiple \boxed{} answers (e.g. \boxed{3}, \boxed{5}, \boxed{7}) by extracting all occurrences and joining them. Duplicates are removed to handle models that repeat the final answer.

NezLheimeur requested a review from 0xSMT as a code owner March 20, 2026 14:56

fxmarty-amd approved these changes Mar 25, 2026

View reviewed changes

fxmarty-amd mentioned this pull request Mar 25, 2026

Wrong Hendrycks filter with base x_y answers or targets, or permutation of several expected numbers x,y,z #3652

Open

NezLheimeur mentioned this pull request Mar 26, 2026

feat: add optional SymPy equivalence and math_verify to hendrycks_math #3655

Open

NezLheimeur added 3 commits March 26, 2026 18:00

Bump hendrycks_math version to 1.1 and update README

07da86e

Add changelog entry for the \boxed{} extraction fix.

Add multi-boxed answer extraction and deduplication

d6cae43

Handle cases where models output multiple \boxed{} answers (e.g. \boxed{3}, \boxed{5}, \boxed{7}) by extracting all occurrences and joining them. Duplicates are removed to handle models that repeat the final answer.

NezLheimeur force-pushed the fix/hendrycks-math-boxed-extraction branch from 92deb43 to d6cae43 Compare March 26, 2026 17:00

NezLheimeur mentioned this pull request Mar 26, 2026

hendrycks_math: process_results does not extract \boxed{} from model response #3643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: extract \boxed{} from model response in hendrycks_math#3644

fix: extract \boxed{} from model response in hendrycks_math#3644
NezLheimeur wants to merge 3 commits intoEleutherAI:mainfrom
NezLheimeur:fix/hendrycks-math-boxed-extraction

NezLheimeur commented Mar 20, 2026 •

edited

Loading

Uh oh!

fxmarty-amd left a comment

Uh oh!

NezLheimeur commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NezLheimeur commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Impact

Uh oh!

fxmarty-amd left a comment

Choose a reason for hiding this comment

Uh oh!

NezLheimeur commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NezLheimeur commented Mar 20, 2026 •

edited

Loading