Description
Hi team! First things first, thank you for creating this wonderful benchmark!
I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.
Summary of the issue
I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.
Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For 01-ai/Yi-1.5-9B-Chat
, the absolute difference of pass@1
is 6.3 for complete and 4.1 for instruct subsets, respectively.
Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!
Results
- Run code evaluation script against pregenerated LLM outputs -> ✅ I can reproduce the leaderboard values:
subset | Leaderboard | local evaluation |
---|---|---|
complete | 42.4 | 41.9 |
instruct | 34.5 | 34.4 |
- Run generation and code evaluation scripts from scratch with prebuilt docker images on A10 GPU -> ❌ I cannot reproduce the leaderboard values:
subset | Leaderboard | local evaluation |
---|---|---|
complete | 42.4 | 36.1 (🔻6.3) |
instruct | 34.5 | 30.4 (🔻4.1) |
Notes
- The number of problems timed-out is as follows, and it cannot account for the discrepancy:
subset | Leaderboard | local evaluation |
---|---|---|
complete | 17 | 15 |
instruct | 17 | 15 |
- I increased the memory limit and didn't get errors like
failed to map segment from shared object
during evaluation
Steps to reproduce
I ran 01-ai/Yi-1.5-9B-Chat
on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps.
The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.
The generation script:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
docker run --gpus '"device=0"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest --model "01-ai/Yi-1.5-9B-Chat" --subset "instruct" --greedy --bs "1" --temperature "0" --n_samples "1" --backend vllm --tp "1"
The evaluation script:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
docker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1.jsonl --calibrate
docker run -m 16g -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --subset instruct --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1-sanitized-calibrated.jsonl --parallel 32
Docker images:
REPOSITORY TAG IMAGE ID CREATED SIZE
bigcodebench/bigcodebench-generate latest eec1e77e88eb 6 days ago 24.6GB
bigcodebench/bigcodebench-evaluate latest 6ff203339e91 6 days ago 5.44GB