Open
Description
The reported OpenR1-Qwen-7B results on AIME24 is 36.7.
While I download the model from huggingface, and use lighteval to evaluate it, I get the results below:
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
all | math_pass@1:32_samples | 0.4740 | ± | 0.0651 | |
extractive_match | 0.4667 | ± | 0.0926 | ||
lighteval:aime24:0 | 1 | math_pass@1:32_samples | 0.4740 | ± | 0.0651 |
extractive_match | 0.4667 | ± | 0.0926 |
Which is much higher than reported!
The evaluation code:
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
--use-chat-template \
--output-dir "$OUTPUT_DIR"
I tried to use data_parallel_size, but encounter with this issue.
Besides, the vllm version I use is 0.8.3
, ray 2.43.0
, lighteval 0.8.1.dev0
.
Has anyone ever faced this situation? Thanks in advance.
@lewtun Do you have any idea? Any comment can be helpful.
Metadata
Metadata
Assignees
Labels
No labels