(Tokenization) Performance Degradation Starting from Transformers v4.42.*

Hi team! First things first, thank you for creating this wonderful benchmark! 
I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.

## Summary of the issue
I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.

Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For `01-ai/Yi-1.5-9B-Chat`, the absolute difference of `pass@1` is 6.3 for complete and 4.1 for instruct subsets, respectively.

Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!

## Results
* Run code evaluation script against [pregenerated LLM outputs](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5) -> ✅ I can reproduce the leaderboard values:

| subset | Leaderboard  | local evaluation |
| - | ------------- | ------------- |
| complete | 42.4 | 41.9 |
| instruct | 34.5 | 34.4 |

* Run generation and code evaluation scripts from scratch with prebuilt docker images on A10 GPU -> ❌ I cannot reproduce the leaderboard values:

| subset | Leaderboard  | local evaluation |
| - | ------------- | ------------- |
| complete | 42.4 | 36.1 (🔻6.3) |
| instruct | 34.5 | 30.4 (🔻4.1) |

### Notes
* The number of problems timed-out is as follows, and it cannot account for the discrepancy:

| subset | Leaderboard  | local evaluation |
| - | ------------- | ------------- |
| complete | 17 | 15 |
| instruct | 17 | 15 |

* I increased the memory limit and didn't get errors like `failed to map segment from shared object` during evaluation

## Steps to reproduce
I ran `01-ai/Yi-1.5-9B-Chat` on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps.
The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.

The generation script:
```bash
#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run --gpus '"device=0"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest --model "01-ai/Yi-1.5-9B-Chat" --subset "instruct" --greedy --bs "1" --temperature "0" --n_samples "1" --backend vllm --tp "1"
```

The evaluation script:
```bash
#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1.jsonl --calibrate
docker run -m 16g -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --subset instruct --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1-sanitized-calibrated.jsonl --parallel 32
```

Docker images:
```bash
REPOSITORY                           TAG       IMAGE ID       CREATED      SIZE
bigcodebench/bigcodebench-generate   latest    eec1e77e88eb   6 days ago   24.6GB
bigcodebench/bigcodebench-evaluate   latest    6ff203339e91   6 days ago   5.44GB
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(Tokenization) Performance Degradation Starting from Transformers v4.42.* #21

Summary of the issue

Results

Notes

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

(Tokenization) Performance Degradation Starting from Transformers v4.42.* #21

Description

Summary of the issue

Results

Notes

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions