Skip to content

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

Open
@Hasuer

Description

@Hasuer

The reported OpenR1-Qwen-7B results on AIME24 is 36.7.

While I download the model from huggingface, and use lighteval to evaluate it, I get the results below:

Task Version Metric Value Stderr
all math_pass@1:32_samples 0.4740 ± 0.0651
extractive_match 0.4667 ± 0.0926
lighteval:aime24:0 1 math_pass@1:32_samples 0.4740 ± 0.0651
extractive_match 0.4667 ± 0.0926

Which is much higher than reported!

The evaluation code:

MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"

I tried to use data_parallel_size, but encounter with this issue.

Besides, the vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0.

Has anyone ever faced this situation? Thanks in advance.

@lewtun Do you have any idea? Any comment can be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions