-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support OpenAI API server in benchmark_serving.py
#2172
Support OpenAI API server in benchmark_serving.py
#2172
Conversation
can you enable maintainers to edit this PR? I have some small fixes here
|
Thanks for the review @simon-mo, unfortunately I don't think I can because my PR is from an organisation-owned fork, not a user-owned fork. Although, I'd be happy to make any changes you suggest! |
I have just realised that using |
Thanks for making the change. Let me add the benchmark script because it requires some back and forth. |
Hmm I see. It is pretty tricky because a one liner won't do. Feel free to open another PR! |
Hi guys, thanks for the PR, just saw that in benchmark: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py#L138 It doesn't count the tokens length in the output response, it take directly the max_tokens as output_len? I think we should measure the actual output token throughput, not the requested max token. |
Hi @tattrongvu, this PR is only really concerned with enabling the benchmarking of the OpenAI compatible server, not with any specifics of the benchmark itself.
This is partially correct. You're right in saying that the actual output token length is not measured, but it's not using vllm/benchmarks/benchmark_serving.py Lines 50 to 57 in 3209b49
This method is not great because it's very unlikely that any model will actually recreate the exact completion in the dataset. The good news is that this is resolved in #2433! |
Hello @tattrongvu! As @hmellor mentioned, the original benchmark script always assumes the model generates the same number of tokens as One of the fixes I did in #2433 is to postprocess the generated text and measure the number of tokens to give a more accurate result. |
Adds
--endpoint
and--model
parameters so that the user can benchmark their OpenAI compatible vLLM server