Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Tokenization) Performance Degradation Starting from Transformers v4.42.* #21

Closed
takkyu2 opened this issue Jul 9, 2024 · 17 comments
Closed
Assignees

Comments

@takkyu2
Copy link

takkyu2 commented Jul 9, 2024

Hi team! First things first, thank you for creating this wonderful benchmark!
I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.

Summary of the issue

I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.

Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For 01-ai/Yi-1.5-9B-Chat, the absolute difference of pass@1 is 6.3 for complete and 4.1 for instruct subsets, respectively.

Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!

Results

subset Leaderboard local evaluation
complete 42.4 41.9
instruct 34.5 34.4
  • Run generation and code evaluation scripts from scratch with prebuilt docker images on A10 GPU -> ❌ I cannot reproduce the leaderboard values:
subset Leaderboard local evaluation
complete 42.4 36.1 (🔻6.3)
instruct 34.5 30.4 (🔻4.1)

Notes

  • The number of problems timed-out is as follows, and it cannot account for the discrepancy:
subset Leaderboard local evaluation
complete 17 15
instruct 17 15
  • I increased the memory limit and didn't get errors like failed to map segment from shared object during evaluation

Steps to reproduce

I ran 01-ai/Yi-1.5-9B-Chat on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps.
The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.

The generation script:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run --gpus '"device=0"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest --model "01-ai/Yi-1.5-9B-Chat" --subset "instruct" --greedy --bs "1" --temperature "0" --n_samples "1" --backend vllm --tp "1"

The evaluation script:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
docker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1.jsonl --calibrate
docker run -m 16g -v $(pwd):/app:rw bigcodebench/bigcodebench-evaluate:latest --subset instruct --samples 01-ai--Yi-1.5-9B-Chat--bigcodebench-instruct--vllm-0-1-sanitized-calibrated.jsonl --parallel 32

Docker images:

REPOSITORY                           TAG       IMAGE ID       CREATED      SIZE
bigcodebench/bigcodebench-generate   latest    eec1e77e88eb   6 days ago   24.6GB
bigcodebench/bigcodebench-evaluate   latest    6ff203339e91   6 days ago   5.44GB
@terryyz terryyz self-assigned this Jul 9, 2024
@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Hi @takkyu2, sorry to hear this!

I'll spend some time today and tomorrow looking into it. Meanwhile, would you mind providing the outputs you generated?

@takkyu2
Copy link
Author

takkyu2 commented Jul 9, 2024

Thank you very much for your help! I attached the json files of local generation/eval results and leaderboard eval results below:
results.zip

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Hi @takkyu2, I re-evaluated the provided files here: yi_results.zip

I only got ~6 timeout tasks. Most of your "timeout" tasks are mainly related to the sklearn dataset download and modeling. I'd expect a longer time required to pass the tests. The current v0.1.7.0 release only focuses on the evaluation speed, so I set the time limit as 120 seconds for both ground truths and generated outputs. I've extended the time limit to 240 seconds in the upcoming v0.1.8 release: #17.

For the re-evaluated results, I got the same results as the reported ones based on the pre-generated outputs. Regarding your local outputs, I got the same scores as yours, though the number of timeout tasks is significantly reduced.

One thing I feel quite strange about is the extra spaces after commas and full stops in the prompts of your local outputs. For example, you got random. shuffle in the docstring, but the actual one is random.shuffle. I assume that the difference may explain the discrepancy. I didn't get the extra space during the generation. I'm now doing a new generation, and it should be finished shortly. I have still found no such issues in the newly generated outputs so far. I'm using vLLM v0.5.1, and the one in the docker should be v0.5.0.

Would you mind doing a new set of generations w/o the docker image and seeing if the extra spaces still exist? I doubt if this is due to some incompatibility of your environment. The original prompts do not have these spaces.

@takkyu2
Copy link
Author

takkyu2 commented Jul 9, 2024

Hi @terryyz, thank you for the very detailed analysis, this helps a lot! Let me rerun the generation without the docker environment.

I didn't realize that space-after-comma issue, this is strange... I will check my environment like the library versions to find some issue there.

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Sorry @takkyu2, here's a correction based on my newly generated outputs (new_yi_results.zip):

  • After detailed comparisons, I did find some extra spaces still exist in the generated docstrings.
  • The newly generated outputs on BigCodeBench Complete got 36.0, which is pretty similar to what you got.

The original generation was done on May 22nd, as documented on my server. The framework should be based on this version: https://github.com/bigcode-project/bigcodebench/tree/3cdf0ea6484c6c4fcb6ef26ed4bf3c7e7be1b552, when the framework was originally called WildCodeBench. There was not much difference between generate.py and model.py except for the module name changes. I barely touched bigcodebench.generate when it became stable. My explanation is that there should be some updates on vLLM, which causes such a great discrepancy.

I'm not sure if it's necessary to update the results of the leaderboard, given that vLLM keeps changing, and so do some model chat templates.

For reference, I attached all the files of Yi 9B Chat here: original_yi_9b_results.zip

I checked other recently evaluated models. They don't have the issues of space-after-comma and space-after-full-stop. I wonder if this issue is Yi-model-specific.

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

FYI, I'm running CodeQwen as an example to see if there is any degradation. Let me know if you want to check other models :)

@takkyu2
Copy link
Author

takkyu2 commented Jul 9, 2024

Thank you @terryyz! hmm yeah the root cause might be that some change at vllm layer affects LLM outputs, causing the discrepancy.

About whether this issue is Yi-model specific, as far as I have tried, instruct task scores are worse compared to the leaderboard value for other models as well. Unlike Yi-model scores those scores were evaluated w/o docker env, so this difference may be attributed to the environment difference though.

instruct task scores evaluated w/o docker environment:

model Leaderboard local evaluation
google/codegemma-7b-it 32.3 27.5 (🔻4.8)
meta-llama/Meta-Llama-3-8B-Instruct 31.9 29.1 (🔻2.8)

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Thanks! @takkyu2
I'll do the evals on these two models and see what I can get.

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Hi @takkyu2,

While I'm waiting for the other two models, here's the result for CodeQwen1.5B-7B-Chat on the Complete split. The difference is not that big.

model Leaderboard local evaluation
Qwen/CodeQwen1.5-7B-Chat 43.6 44.7 (🔺1.1)

I also noticed there have been quite many discussions in vLLM regarding the inconsistency of greedy decoding: vllm-project/vllm#5898. I generally use a batch size of 5 to speed up the process. I should pin a separate issue for this in our repo. However, I don't expect that the inconsistency will result in a great discrepancy. My current guess is that our observed difference is likely due to the updates in the vLLM version. Also a note: there was a big change in vLLM from v0.4.3 to v0.5.0 on June 12th hmmm, https://github.com/vllm-project/vllm/releases/tag/v0.5.0.

BTW, could you please check the pass rate of the ground truths in your local environment? That will tell you if the great discrepancy is due to the local environment or just the generations. Ideally, the ground-truth pass rate is close to 100%, I have 99.6% on my machine for example.

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Okay, I got the following results on my machine using vLLM v0.5.1.

model Leaderboard local evaluation
google/codegemma-7b-it 32.3 28.3 (🔻4)
meta-llama/Meta-Llama-3-8B-Instruct 31.9 28.8 (🔻3.1)

The results are very close to yours, suggesting the decoding inconsistency is minimal. The main reason of the degradation should be the changes from v0.4.* to 0.5.*.

@terryyz
Copy link
Collaborator

terryyz commented Jul 9, 2024

Hi @takkyu2, I did more ablation studies.

TL;DR: The main issue is the transformers version, while vLLM still has some inconsistency.

I experimented with different vLLM versions, and the results didn't change much. So did flash-attn and triton. However, I observed a great difference when downgrading the transformers to v4.40.2. I remember that I was using v4.40.* to evaluate the models reported on the arXiv paper.

Specifically, I used Yi-9B-Chat as an example: 4402_yi.zip

subset Leaderboard local evaluation
complete 42.4 41.8 (🔻0.6)
instruct 34.5 33.4 (🔻1.1)

The weird extra spaces disappeared in the attached outputs. I haven't noticed anyone discussing similar issues before. It should be a big issue IMO. However, due to the lack of detailed investigations, I don't know which part of the implementation resulted in such a degradation. Let me know if you'd like to investigate this. Otherwise, we can simply file an issue in the transformers repo.

@takkyu2
Copy link
Author

takkyu2 commented Jul 10, 2024

Hi @terryyz, thanks a lot for the quick turnaround and spotting the root cause!

I agree with you in that filing this issue to transformers folks is a good idea. This sounds an unexpected change on the transformers side and they should know what kind of changes occurred between v4.40.* and later better than we do.

Thank you again for your tremendous help!

@ArthurZucker
Copy link

I answered on the thread but am available to fix this asap! sounds bad 😢

@terryyz
Copy link
Collaborator

terryyz commented Jul 10, 2024

Thanks @ArthurZucker! Hope it will be fixed soon. I expect this issue will greatly affect other benchmarks. It should be a big problem, but no one has concretely discussed this...

@terryyz terryyz changed the title Unable to reproduce leaderboard pass@1 by local generation/evaluation with prebuilt docker image Performance Degradation Starting from Transformers v4.42.* Jul 11, 2024
@terryyz terryyz pinned this issue Jul 11, 2024
@terryyz terryyz changed the title Performance Degradation Starting from Transformers v4.42.* (Tokenization) Performance Degradation Starting from Transformers v4.42.* Jul 17, 2024
@terryyz
Copy link
Collaborator

terryyz commented Jul 17, 2024

Hi @takkyu2! Just a note that v0.1.8 has been released with a temporary fix. More details about BigCodeBench-Hard can be found in https://huggingface.co/blog/terryyz/bigcodebench-hard.

@terryyz
Copy link
Collaborator

terryyz commented Jul 17, 2024

Closed this issue for now :)

@terryyz terryyz closed this as completed Jul 17, 2024
@takkyu2
Copy link
Author

takkyu2 commented Jul 18, 2024

Thanks a lot @terryyz for addressing the issue, and congratulations to the bigcodebench-hard release 🎉! I will try v0.1.8 when I have enough bandwidth.

@terryyz terryyz unpinned this issue Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants