Skip to content

[BUG] Is AIME24 broken? #771

Open
Open
@eldarkurtic

Description

@eldarkurtic

Hi folks,

I've been trying to reproduce DeepSeek-R1-Distill-Llama-8B evals, but seems like there is something wrong going on with AIME24 (MATH-500 and GPQA:Diamond are looking fine):

|          Task          |Version|        Metric        |Value |   |Stderr|
|------------------------|------:|----------------------|-----:|---|-----:|
|all                     |       |math_pass@1:1_samples |0.6763|±  |0.0534|
|                        |       |math_pass@1:4_samples |0.6782|±  |0.0388|
|                        |       |math_pass@1:8_samples |0.4375|±  |0.0636|
|                        |       |math_pass@1:16_samples|0.2188|±  |0.0318|
|                        |       |math_pass@1:32_samples|0.1094|±  |0.0159|
|                        |       |math_pass@1:64_samples|0.0547|±  |0.0080|
|                        |       |gpqa_pass@1:1_samples |0.5051|±  |0.0356|
|                        |       |gpqa_pass@1:4_samples |0.4710|±  |0.0273|
|                        |       |gpqa_pass@1:8_samples |0.4836|±  |0.0264|
|lighteval:aime24:0      |      2|math_pass@1:1_samples |0.4667|±  |0.0926|
|                        |       |math_pass@1:4_samples |0.4583|±  |0.0668|
|                        |       |math_pass@1:8_samples |0.4375|±  |0.0636|
|                        |       |math_pass@1:16_samples|0.2188|±  |0.0318|
|                        |       |math_pass@1:32_samples|0.1094|±  |0.0159|
|                        |       |math_pass@1:64_samples|0.0547|±  |0.0080|  <-- 0.0547 vs 0.439 (Open-R1 README)
|lighteval:gpqa:diamond:0|      1|gpqa_pass@1:1_samples |0.5051|±  |0.0356|
|                        |       |gpqa_pass@1:4_samples |0.4710|±  |0.0273|
|                        |       |gpqa_pass@1:8_samples |0.4836|±  |0.0264|
|lighteval:math_500:0    |      2|math_pass@1:1_samples |0.8860|±  |0.0142|
|                        |       |math_pass@1:4_samples |0.8980|±  |0.0108|            

i.e. I am seeing math_pass@1:64_samples 5.47 versus the expected 43.9 (expected based on https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results).

(cross-posting this from Open-R1 huggingface/open-r1#655 since I am not sure who the owner is for reasoning evals)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions