Open
Description
Hi folks,
I've been trying to reproduce DeepSeek-R1-Distill-Llama-8B evals, but seems like there is something wrong going on with AIME24 (MATH-500 and GPQA:Diamond are looking fine):
| Task |Version| Metric |Value | |Stderr|
|------------------------|------:|----------------------|-----:|---|-----:|
|all | |math_pass@1:1_samples |0.6763|± |0.0534|
| | |math_pass@1:4_samples |0.6782|± |0.0388|
| | |math_pass@1:8_samples |0.4375|± |0.0636|
| | |math_pass@1:16_samples|0.2188|± |0.0318|
| | |math_pass@1:32_samples|0.1094|± |0.0159|
| | |math_pass@1:64_samples|0.0547|± |0.0080|
| | |gpqa_pass@1:1_samples |0.5051|± |0.0356|
| | |gpqa_pass@1:4_samples |0.4710|± |0.0273|
| | |gpqa_pass@1:8_samples |0.4836|± |0.0264|
|lighteval:aime24:0 | 2|math_pass@1:1_samples |0.4667|± |0.0926|
| | |math_pass@1:4_samples |0.4583|± |0.0668|
| | |math_pass@1:8_samples |0.4375|± |0.0636|
| | |math_pass@1:16_samples|0.2188|± |0.0318|
| | |math_pass@1:32_samples|0.1094|± |0.0159|
| | |math_pass@1:64_samples|0.0547|± |0.0080| <-- 0.0547 vs 0.439 (Open-R1 README)
|lighteval:gpqa:diamond:0| 1|gpqa_pass@1:1_samples |0.5051|± |0.0356|
| | |gpqa_pass@1:4_samples |0.4710|± |0.0273|
| | |gpqa_pass@1:8_samples |0.4836|± |0.0264|
|lighteval:math_500:0 | 2|math_pass@1:1_samples |0.8860|± |0.0142|
| | |math_pass@1:4_samples |0.8980|± |0.0108|
i.e. I am seeing math_pass@1:64_samples
5.47 versus the expected 43.9 (expected based on https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results).
(cross-posting this from Open-R1 huggingface/open-r1#655 since I am not sure who the owner is for reasoning evals)