Fix gsm8k task to enhance accuracy #2924

zhangxinyuehfad · 2025-04-21T09:31:40Z

When using the original GSM8K configuration, I found that the output accuracy was low. In issue #2707 (#2707), this was fixed by optimizing the output and evaluation format, which significantly improved the accuracy.

The results are as follows:
before:

after:

Signed-off-by: hfadzxy <starmoon_zhang@163.com>

CLAassistant · 2025-04-21T09:31:48Z

All committers have signed the CLA.

leo-pony · 2025-04-28T07:10:47Z

@baberabb
Could you please help review this PR? I also found the low accuracy problem caused by the difference between output format and judgment format in the above gsm8k test.

leo-pony · 2025-05-07T07:04:58Z

@baberabb
Could you please help review this PR? I also found the low accuracy problem caused by the difference between output format and judgment format in the above gsm8k test.

@StellaAthena Could you please help review this PR?

baberabb · 2025-05-08T14:13:54Z

Hi @leo-pony. Thanks for the PR. Could you reference where you sourced this prompt from? Our library aims to provide standardized metrics for direct model-to-model comparisons, which is why we're careful about prompt modifications. Some details can be found in the readme.
Have you tried evaluating on gsm8k_llama, which is based on the llama evals? It's very similar to your modification, so I think results might be similar.

leo-pony · 2025-05-09T09:07:18Z

Hi @leo-pony. Thanks for the PR. Could you reference where you sourced this prompt from? Our library aims to provide standardized metrics for direct model-to-model comparisons, which is why we're careful about prompt modifications. Some details can be found in the readme. Have you tried evaluating on gsm8k_llama, which is based on the llama evals? It's very similar to your modification, so I think results might be similar.

@baberabb Thanks for your review!
We applied the answer result formatting prompts in this PR to the llama3.1-8B-Instruct model and found that this prompt had no effect on llama3.1-8B-Instruct. It seems that qwen2.5 did not understand the current result formatting prompts of the lm-eval community well, but understood the result formatting prompts in this PR better.
The detailed test results of llama3.1-7B-instruct using this PR prompt are as follows:

The detailed test results of llama3.1-7B-instruct using community prompt are as follows:

Update gsm8k task

88ae6ba

Signed-off-by: hfadzxy <starmoon_zhang@163.com>

zhangxinyuehfad requested review from baberabb and StellaAthena as code owners April 21, 2025 09:31

zhangxinyuehfad changed the title ~~Update gsm8k task~~ Fix gsm8k task to enhance accuracy Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix gsm8k task to enhance accuracy #2924

Fix gsm8k task to enhance accuracy #2924

Uh oh!

zhangxinyuehfad commented Apr 21, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Apr 21, 2025 •

edited

Loading

Uh oh!

leo-pony commented Apr 28, 2025

Uh oh!

leo-pony commented May 7, 2025

Uh oh!

baberabb commented May 8, 2025

Uh oh!

leo-pony commented May 9, 2025

Uh oh!

Uh oh!

Fix gsm8k task to enhance accuracy #2924

Are you sure you want to change the base?

Fix gsm8k task to enhance accuracy #2924

Uh oh!

Conversation

zhangxinyuehfad commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leo-pony commented Apr 28, 2025

Uh oh!

leo-pony commented May 7, 2025

Uh oh!

baberabb commented May 8, 2025

Uh oh!

leo-pony commented May 9, 2025

Uh oh!

Uh oh!

zhangxinyuehfad commented Apr 21, 2025 •

edited

Loading

CLAassistant commented Apr 21, 2025 •

edited

Loading