Fix unescaped '.' in final_number_exact_match regex#4191
Open
Chessing234 wants to merge 1 commit intostanford-crfm:mainfrom
Open
Fix unescaped '.' in final_number_exact_match regex#4191Chessing234 wants to merge 1 commit intostanford-crfm:mainfrom
Chessing234 wants to merge 1 commit intostanford-crfm:mainfrom
Conversation
`final_number_exact_match` uses
re.findall(r"-?[\d,]+(?:.\d+)?", x)
to pull the "final number" out of a generation for the GSM scenario
and any run that selects the `final_number_exact_match` metric. The
`.` inside the non-capturing group is unescaped, so it matches *any*
single character, not just a decimal point. Any one-character
non-digit sandwich between digit runs is therefore glued into a
single "number".
Concretely, for `pred = "3,1415x926"`:
- Current regex captures `["3,1415x926"]`; after `replace(",", "")`
the returned "final number" is `"31415x926"`, which never matches
a real gold answer.
- Correct regex should yield `["3,1415", "926"]`, giving the final
number as `"926"`.
The existing test cases in `test_final_number_exact_match` all pass
with the fix (verified by hand): none of them exercise the "letter
between digit runs" edge case, so none of the asserts change.
Escape the `.` so the optional decimal part only matches a literal
period followed by digits, as the docstring and the GSM usage
clearly intend.
yifanmai
requested changes
Apr 16, 2026
Collaborator
yifanmai
left a comment
There was a problem hiding this comment.
Thanks. This looks good in general, but please add a test case in test_final_number_exact_match() in test_evaluate_reference_metrics.py that addresses this specific bug.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
final_number_exact_matchin `src/helm/benchmark/metrics/evaluate_reference_metrics.py` extracts the "final number" from a generation with:```python
matches = re.findall(r"-?[\d,]+(?:.\d+)?", x)
```
The `.` inside the optional non-capturing group is unescaped, so it matches any single character — not just a decimal point. Any single non-digit character sandwiched between digit runs gets glued into one "number", and the function returns a value that cannot possibly match a gold answer.
This metric is the main metric for GSM (`src/helm/benchmark/scenarios/gsm_scenario.py:86`, `main_metric="final_number_exact_match"`) and is exposed via `lite_run_specs.py:154`, so the bug affects grade-school-math evaluation results whenever a model response happens to produce digit-letter-digit patterns.
Worked example
`pred = "3,1415x926"`:
re.findallresult.)[\"3,1415x926\"]\"31415x926\"\.)[\"3,1415\", \"926\"]\"926\"With the current regex,
.(any char) lets the non-capturing group swallow thexand the trailing digits, producing a single pseudo-number that's guaranteed not to equal any real gold. With the fix, the group only matches a literal period followed by digits (as the function's docstring and the GSM use-case clearly intend), so the two digit runs are separated and the final actual number is returned.Other patterns affected by the same bug: anything with one stray non-digit between digit sequences, e.g.
\"9a8\",\"1e6\"(collapses scientific notation mantissa into an unusable blob),\"3 4\"if the non-digit is a space, etc.Fix
Escape the period so the optional decimal part only matches a literal
.:```diff
```
All existing assertions in `test_final_number_exact_match` (tests around
\"33\",\"-33\",\"33 eggs.\",\"\\\\boxed{33}\",\"34.2\",\"342.\",\"3,420\") continue to pass by hand-verification — none of them exercise the letter-between-digits case, so their expected outputs don't change. Only the previously-buggy digit-letter-digit inputs now return the correct final number.