OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported!

The [reported](https://huggingface.co/open-r1/OpenR1-Qwen-7B) OpenR1-Qwen-7B results on AIME24 is 36.7.

While I download the [model](https://huggingface.co/open-r1/OpenR1-Qwen-7B) from huggingface, and use lighteval to evaluate it, I get the results below:
|       Task       |Version|        Metric        |Value |   |Stderr|
|------------------|------:|----------------------|-----:|---|-----:|
|all               |       |math_pass@1:32_samples|0.4740|±  |0.0651|
|                  |       |extractive_match      |0.4667|±  |0.0926|
|lighteval:aime24:0|      1|math_pass@1:32_samples|0.4740|±  |0.0651|
|                  |       |extractive_match      |0.4667|±  |0.0926|

Which is **much higher** than reported!

The evaluation code:
```
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"
```

I tried to use data_parallel_size, but encounter with this [issue](https://github.com/huggingface/lighteval/issues/670).

Besides, the vllm version I use is `0.8.3`, ray `2.43.0`, lighteval `0.8.1.dev0`.

Has anyone ever faced this situation? Thanks in advance.

@lewtun Do you have any idea? Any comment can be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task	Version	Metric	Value		Stderr
all		math_pass@1:32_samples	0.4740	±	0.0651
		extractive_match	0.4667	±	0.0926
lighteval:aime24:0	1	math_pass@1:32_samples	0.4740	±	0.0651
		extractive_match	0.4667	±	0.0926

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions