Skip to content

Fix unescaped '.' in final_number_exact_match regex#4191

Open
Chessing234 wants to merge 1 commit intostanford-crfm:mainfrom
Chessing234:fix/final-number-regex-unescaped-dot
Open

Fix unescaped '.' in final_number_exact_match regex#4191
Chessing234 wants to merge 1 commit intostanford-crfm:mainfrom
Chessing234:fix/final-number-regex-unescaped-dot

Conversation

@Chessing234
Copy link
Copy Markdown
Contributor

Bug

final_number_exact_match in `src/helm/benchmark/metrics/evaluate_reference_metrics.py` extracts the "final number" from a generation with:

```python
matches = re.findall(r"-?[\d,]+(?:.\d+)?", x)
```

The `.` inside the optional non-capturing group is unescaped, so it matches any single character — not just a decimal point. Any single non-digit character sandwiched between digit runs gets glued into one "number", and the function returns a value that cannot possibly match a gold answer.

This metric is the main metric for GSM (`src/helm/benchmark/scenarios/gsm_scenario.py:86`, `main_metric="final_number_exact_match"`) and is exposed via `lite_run_specs.py:154`, so the bug affects grade-school-math evaluation results whenever a model response happens to produce digit-letter-digit patterns.

Worked example

`pred = "3,1415x926"`:

Regex re.findall result Returned "final number"
Current (.) [\"3,1415x926\"] \"31415x926\"
Fixed (\.) [\"3,1415\", \"926\"] \"926\"

With the current regex, . (any char) lets the non-capturing group swallow the x and the trailing digits, producing a single pseudo-number that's guaranteed not to equal any real gold. With the fix, the group only matches a literal period followed by digits (as the function's docstring and the GSM use-case clearly intend), so the two digit runs are separated and the final actual number is returned.

Other patterns affected by the same bug: anything with one stray non-digit between digit sequences, e.g. \"9a8\", \"1e6\" (collapses scientific notation mantissa into an unusable blob), \"3 4\" if the non-digit is a space, etc.

Fix

Escape the period so the optional decimal part only matches a literal .:

```diff

  •    matches = re.findall(r\"-?[\d,]+(?:.\d+)?\", x)
    
  •    matches = re.findall(r\"-?[\d,]+(?:\.\d+)?\", x)
    

```

All existing assertions in `test_final_number_exact_match` (tests around \"33\", \"-33\", \"33 eggs.\", \"\\\\boxed{33}\", \"34.2\", \"342.\", \"3,420\") continue to pass by hand-verification — none of them exercise the letter-between-digits case, so their expected outputs don't change. Only the previously-buggy digit-letter-digit inputs now return the correct final number.

`final_number_exact_match` uses

    re.findall(r"-?[\d,]+(?:.\d+)?", x)

to pull the "final number" out of a generation for the GSM scenario
and any run that selects the `final_number_exact_match` metric. The
`.` inside the non-capturing group is unescaped, so it matches *any*
single character, not just a decimal point. Any one-character
non-digit sandwich between digit runs is therefore glued into a
single "number".

Concretely, for `pred = "3,1415x926"`:

- Current regex captures `["3,1415x926"]`; after `replace(",", "")`
  the returned "final number" is `"31415x926"`, which never matches
  a real gold answer.
- Correct regex should yield `["3,1415", "926"]`, giving the final
  number as `"926"`.

The existing test cases in `test_final_number_exact_match` all pass
with the fix (verified by hand): none of them exercise the "letter
between digit runs" edge case, so none of the asserts change.

Escape the `.` so the optional decimal part only matches a literal
period followed by digits, as the docstring and the GSM usage
clearly intend.
Copy link
Copy Markdown
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This looks good in general, but please add a test case in test_final_number_exact_match() in test_evaluate_reference_metrics.py that addresses this specific bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants