Skip to content

(Tokenization) Performance Degradation Starting from Transformers v4.42.* #21

Closed
@takkyu2

Description

@takkyu2

Hi team! First things first, thank you for creating this wonderful benchmark!
I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.

Summary of the issue

I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.

Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For 01-ai/Yi-1.5-9B-Chat, the absolute difference of pass@1 is 6.3 for complete and 4.1 for instruct subsets, respectively.

Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!

Results

subset Leaderboard local evaluation
complete 42.4 41.9
instruct 34.5 34.4
  • Run generation and code evaluation scripts from scratch with prebuilt docker images on A10 GPU -> ❌ I cannot reproduce the leaderboard values:
subset Leaderboard local evaluation
complete 42.4 36.1 (🔻6.3)
instruct 34.5 30.4 (🔻4.1)

Notes

  • The number of problems timed-out is as follows, and it cannot account for the discrepancy:
subset Leaderboard local evaluation
complete 17 15
instruct 17 15
  • I increased the memory limit and didn't get errors like failed to map segment from shared object during evaluation

Steps to reproduce

I ran 01-ai/Yi-1.5-9B-Chat on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps.
The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.

The generation script:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run --gpus '"device=0"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest --model "01-ai/Yi-1.5-9B-Chat" --subset "instruct" --greedy --bs "1" --temperature "0" --n_samples "1" --backend vllm --tp "1"

The evaluation script:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1.jsonl --calibrate
docker run -m 16g -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --subset instruct --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1-sanitized-calibrated.jsonl --parallel 32

Docker images:

REPOSITORY                           TAG       IMAGE ID       CREATED      SIZE
bigcodebench/bigcodebench-generate   latest    eec1e77e88eb   6 days ago   24.6GB
bigcodebench/bigcodebench-evaluate   latest    6ff203339e91   6 days ago   5.44GB

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions