Fail to reproduce MATH-500 Score on DeepSeek-R1-Distill-Qwen-1.5B

Here is my script and the corresponding results:
```
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL

NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```
```
[2025-02-18 06:01:32,413] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:299)
[2025-02-18 06:01:42,431] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:342)
|      Task       |Version|     Metric     |Value|   |Stderr|
|-----------------|------:|----------------|----:|---|-----:|
|all              |       |extractive_match|0.766|_  | 0.019|
|custom:math_500:0|      1|extractive_match|0.766|_  | 0.019|
```
The score is 5 points lower than the one reported in the README.

I also referred to issue #194 but was unable to reproduce the reported score of 81.6.
My environment:
```
# CUDA 12.4
torch                             2.5.1
latex2sympy2_extended             1.0.6
vllm                              0.7.2 
math-verify                       0.5.2
lighteval                         0.6.0.dev0
```

Are there any additional steps I can try to debug or improve my results? Any suggestions would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to reproduce MATH-500 Score on DeepSeek-R1-Distill-Qwen-1.5B #354

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to reproduce MATH-500 Score on DeepSeek-R1-Distill-Qwen-1.5B #354

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions