Closed
Description
Here is my script and the corresponding results:
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL
NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR
[2025-02-18 06:01:32,413] [ INFO]: --- COMPUTING METRICS --- (pipeline.py:299)
[2025-02-18 06:01:42,431] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:342)
| Task |Version| Metric |Value| |Stderr|
|-----------------|------:|----------------|----:|---|-----:|
|all | |extractive_match|0.766|_ | 0.019|
|custom:math_500:0| 1|extractive_match|0.766|_ | 0.019|
The score is 5 points lower than the one reported in the README.
I also referred to issue #194 but was unable to reproduce the reported score of 81.6.
My environment:
# CUDA 12.4
torch 2.5.1
latex2sympy2_extended 1.0.6
vllm 0.7.2
math-verify 0.5.2
lighteval 0.6.0.dev0
Are there any additional steps I can try to debug or improve my results? Any suggestions would be greatly appreciated.
Metadata
Metadata
Assignees
Labels
No labels