-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Tokenization) Performance Degradation Starting from Transformers v4.42.* #21
Comments
Hi @takkyu2, sorry to hear this! I'll spend some time today and tomorrow looking into it. Meanwhile, would you mind providing the outputs you generated? |
Thank you very much for your help! I attached the json files of local generation/eval results and leaderboard eval results below: |
Hi @takkyu2, I re-evaluated the provided files here: yi_results.zip I only got ~6 timeout tasks. Most of your "timeout" tasks are mainly related to the sklearn dataset download and modeling. I'd expect a longer time required to pass the tests. The current v0.1.7.0 release only focuses on the evaluation speed, so I set the time limit as 120 seconds for both ground truths and generated outputs. I've extended the time limit to 240 seconds in the upcoming v0.1.8 release: #17. For the re-evaluated results, I got the same results as the reported ones based on the pre-generated outputs. Regarding your local outputs, I got the same scores as yours, though the number of timeout tasks is significantly reduced. One thing I feel quite strange about is the extra spaces after commas and full stops in the prompts of your local outputs. For example, you got Would you mind doing a new set of generations w/o the docker image and seeing if the extra spaces still exist? I doubt if this is due to some incompatibility of your environment. The original prompts do not have these spaces. |
Hi @terryyz, thank you for the very detailed analysis, this helps a lot! Let me rerun the generation without the docker environment. I didn't realize that space-after-comma issue, this is strange... I will check my environment like the library versions to find some issue there. |
Sorry @takkyu2, here's a correction based on my newly generated outputs (new_yi_results.zip):
The original generation was done on May 22nd, as documented on my server. The framework should be based on this version: https://github.com/bigcode-project/bigcodebench/tree/3cdf0ea6484c6c4fcb6ef26ed4bf3c7e7be1b552, when the framework was originally called WildCodeBench. There was not much difference between I'm not sure if it's necessary to update the results of the leaderboard, given that vLLM keeps changing, and so do some model chat templates. For reference, I attached all the files of Yi 9B Chat here: original_yi_9b_results.zip I checked other recently evaluated models. They don't have the issues of space-after-comma and space-after-full-stop. I wonder if this issue is Yi-model-specific. |
FYI, I'm running CodeQwen as an example to see if there is any degradation. Let me know if you want to check other models :) |
Thank you @terryyz! hmm yeah the root cause might be that some change at vllm layer affects LLM outputs, causing the discrepancy. About whether this issue is Yi-model specific, as far as I have tried, instruct task scores are worse compared to the leaderboard value for other models as well. Unlike Yi-model scores those scores were evaluated w/o docker env, so this difference may be attributed to the environment difference though. instruct task scores evaluated w/o docker environment:
|
Thanks! @takkyu2 |
Hi @takkyu2, While I'm waiting for the other two models, here's the result for CodeQwen1.5B-7B-Chat on the Complete split. The difference is not that big.
I also noticed there have been quite many discussions in vLLM regarding the inconsistency of greedy decoding: vllm-project/vllm#5898. I generally use a batch size of 5 to speed up the process. I should pin a separate issue for this in our repo. However, I don't expect that the inconsistency will result in a great discrepancy. My current guess is that our observed difference is likely due to the updates in the vLLM version. Also a note: there was a big change in vLLM from BTW, could you please check the pass rate of the ground truths in your local environment? That will tell you if the great discrepancy is due to the local environment or just the generations. Ideally, the ground-truth pass rate is close to 100%, I have 99.6% on my machine for example. |
Okay, I got the following results on my machine using vLLM v0.5.1.
The results are very close to yours, suggesting the decoding inconsistency is minimal. |
Hi @takkyu2, I did more ablation studies. TL;DR: The main issue is the I experimented with different Specifically, I used Yi-9B-Chat as an example: 4402_yi.zip
The weird extra spaces disappeared in the attached outputs. I haven't noticed anyone discussing similar issues before. It should be a big issue IMO. However, due to the lack of detailed investigations, I don't know which part of the implementation resulted in such a degradation. Let me know if you'd like to investigate this. Otherwise, we can simply file an issue in the |
Hi @terryyz, thanks a lot for the quick turnaround and spotting the root cause! I agree with you in that filing this issue to Thank you again for your tremendous help! |
I answered on the thread but am available to fix this asap! sounds bad 😢 |
Thanks @ArthurZucker! Hope it will be fixed soon. I expect this issue will greatly affect other benchmarks. It should be a big problem, but no one has concretely discussed this... |
Hi @takkyu2! Just a note that v0.1.8 has been released with a temporary fix. More details about BigCodeBench-Hard can be found in https://huggingface.co/blog/terryyz/bigcodebench-hard. |
Closed this issue for now :) |
Thanks a lot @terryyz for addressing the issue, and congratulations to the bigcodebench-hard release 🎉! I will try v0.1.8 when I have enough bandwidth. |
Hi team! First things first, thank you for creating this wonderful benchmark!
I believe its curation and evaluation required a lot of effort, so I really appreciate it that you open-sourced the datasets and evaluation scripts for the community.
Summary of the issue
I have been trying to reproduce the leaderboard value by running the scripts locally, and I found that the metrics evaluated locally are consistently worse compared to the leaderboard values.
Although I understand that it is very hard to reproduce the exact value from the leaderboard, the difference is rather large: For
01-ai/Yi-1.5-9B-Chat
, the absolute difference ofpass@1
is 6.3 for complete and 4.1 for instruct subsets, respectively.Please let me know if I have made any mistakes on my side or if I can provide further information for diagnosing the issue. Thank you!
Results
Notes
failed to map segment from shared object
during evaluationSteps to reproduce
I ran
01-ai/Yi-1.5-9B-Chat
on A10 GPU to generate LLM responses and evaluated it, using docker images for both steps.The evaluation was done on 2024-07-08 14:00 for complete and 2024-07-08 18:57 for instruct.
The generation script:
The evaluation script:
Docker images:
The text was updated successfully, but these errors were encountered: