Skip to content

Fail to reproduce MATH-500 Score on DeepSeek-R1-Distill-Qwen-1.5B #354

Closed
@superdocker

Description

@superdocker

Here is my script and the corresponding results:

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL

NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
[2025-02-18 06:01:32,413] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:299)
[2025-02-18 06:01:42,431] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:342)
|      Task       |Version|     Metric     |Value|   |Stderr|
|-----------------|------:|----------------|----:|---|-----:|
|all              |       |extractive_match|0.766|_  | 0.019|
|custom:math_500:0|      1|extractive_match|0.766|_  | 0.019|

The score is 5 points lower than the one reported in the README.

I also referred to issue #194 but was unable to reproduce the reported score of 81.6.
My environment:

# CUDA 12.4
torch                             2.5.1
latex2sympy2_extended             1.0.6
vllm                              0.7.2 
math-verify                       0.5.2
lighteval                         0.6.0.dev0

Are there any additional steps I can try to debug or improve my results? Any suggestions would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions