Skip to content

Fix gsm8k task to enhance accuracy #2924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zhangxinyuehfad
Copy link

@zhangxinyuehfad zhangxinyuehfad commented Apr 21, 2025

When using the original GSM8K configuration, I found that the output accuracy was low. In issue #2707 (#2707), this was fixed by optimizing the output and evaluation format, which significantly improved the accuracy.

The results are as follows:
before:
image
after:
image

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
@CLAassistant
Copy link

CLAassistant commented Apr 21, 2025

CLA assistant check
All committers have signed the CLA.

@zhangxinyuehfad zhangxinyuehfad changed the title Update gsm8k task Fix gsm8k task to enhance accuracy Apr 22, 2025
@leo-pony
Copy link

@baberabb
Could you please help review this PR? I also found the low accuracy problem caused by the difference between output format and judgment format in the above gsm8k test.

@leo-pony
Copy link

leo-pony commented May 7, 2025

@baberabb
Could you please help review this PR? I also found the low accuracy problem caused by the difference between output format and judgment format in the above gsm8k test.

@StellaAthena Could you please help review this PR?

@baberabb
Copy link
Contributor

baberabb commented May 8, 2025

Hi @leo-pony. Thanks for the PR. Could you reference where you sourced this prompt from? Our library aims to provide standardized metrics for direct model-to-model comparisons, which is why we're careful about prompt modifications. Some details can be found in the readme.
Have you tried evaluating on gsm8k_llama, which is based on the llama evals? It's very similar to your modification, so I think results might be similar.

@leo-pony
Copy link

leo-pony commented May 9, 2025

Hi @leo-pony. Thanks for the PR. Could you reference where you sourced this prompt from? Our library aims to provide standardized metrics for direct model-to-model comparisons, which is why we're careful about prompt modifications. Some details can be found in the readme. Have you tried evaluating on gsm8k_llama, which is based on the llama evals? It's very similar to your modification, so I think results might be similar.

@baberabb Thanks for your review!
We applied the answer result formatting prompts in this PR to the llama3.1-8B-Instruct model and found that this prompt had no effect on llama3.1-8B-Instruct. It seems that qwen2.5 did not understand the current result formatting prompts of the lm-eval community well, but understood the result formatting prompts in this PR better.
The detailed test results of llama3.1-7B-instruct using this PR prompt are as follows:
image
The detailed test results of llama3.1-7B-instruct using community prompt are as follows:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants