评测数学能力pass@1如何生成多个结果来评估? #2250
Unanswered
DELEnomore
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
我想要复现DeepSeek-R1的评测结果。我添加了generation_kwargs num_return_sequences=2,之后,在AIME2025和MATH500上的结果都是0(添加参数之前可以正常出结果)?我注意到Docs > 强推理模型评测教程这一章节的做法是最终的结果取平均值,虽然计算路线不一样但是数学结果上来说似乎是一致的,这是否是唯一做法?有没有针对同一个问题的多次回答先算平均值最后在对所有问题求平均的方法?并且我需要使用MathEvaluator的能力,基于规则评价而不是LLM
Beta Was this translation helpful? Give feedback.
All reactions