-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ilyal-cerebras, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've made some adjustments to our OpenAI API benchmarking functions to enhance their robustness and accuracy when dealing with various OpenAI API implementations. The primary goal is to ensure our Time To First Token (TTFT) measurements are precise and that we reliably capture all relevant output metrics, even when API responses deviate slightly from typical patterns, such as sending initial empty chunks or combining content and usage data.
Highlights
- TTFT Calculation Refinement: I've updated the
async_request_openai_chat_completions
function to calculate Time To First Token (TTFT) only when the first actual content-bearing token is received. This addresses scenarios where some OpenAI API implementations might send an initial 'empty' chunk, ensuring more accurate TTFT measurements by waiting for the first meaningful content. - Robust Output Parsing: In
async_request_openai_chat_completions
, I've changed the parsing logic from anelif
to anif
statement for handlingusage
information. This allows bothchoices
(generated content) andusage
(token count) data to be processed correctly if they appear within the same API response chunk, preventingoutput_tokens
from being missed. - Timing Precision: I've adjusted the TTFT calculation in
async_request_openai_completions
to use thetimestamp
of the received chunk instead oftime.perf_counter()
. This provides a more precise and consistent measurement based on the actual chunk arrival time, aligning with the goal of robust benchmarking.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the robustness and accuracy of the OpenAI-based benchmarking functions. The changes correctly handle cases where initial API responses might not contain actual content and ensure that both choices
and usage
data are processed correctly when present in the same chunk. These are important fixes for reliable performance measurement.
benchmarks/backend_request_func.py
Outdated
@@ -316,7 +316,7 @@ async def async_request_openai_completions( | |||
# First token | |||
if not first_chunk_received: | |||
first_chunk_received = True | |||
ttft = time.perf_counter() - st | |||
ttft = timestamp - st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly uses the timestamp
variable, which is updated per chunk, for calculating the Time To First Token (TTFT). This ensures that the TTFT is accurately measured based on the arrival of the first meaningful token, rather than just the first chunk, which might be empty. This aligns with the PR description's goal of making the benchmark more robust to different OpenAI API implementations.
benchmarks/backend_request_func.py
Outdated
if content: | ||
# First token | ||
if ttft == 0.0: | ||
ttft = timestamp - st | ||
output.ttft = ttft | ||
|
||
generated_text += content or "" | ||
elif usage := data.get("usage"): | ||
# Decoding phase | ||
else: | ||
output.itl.append(timestamp - most_recent_timestamp) | ||
|
||
generated_text += content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping the TTFT and decoding phase logic within an if content:
block is a crucial improvement. Some OpenAI API implementations might send an initial chunk with no actual content, which would incorrectly trigger TTFT calculation. This change ensures that TTFT is measured only when a chunk contains actual generated text, leading to more accurate benchmarking.
generated_text += content | ||
if usage := data.get("usage"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing elif usage := data.get("usage"):
to if usage := data.get("usage"):
is important for correctness. It allows the usage
information to be processed even if the chunk also contains choices
data. Some OpenAI API implementations might include both choices
and usage
in the same chunk, and using elif
would prevent output_tokens
from being correctly populated in such cases. This ensures that output_tokens
is always filled when usage
data is available.
0b21f4e
to
e872136
Compare
…enAI impls Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
e872136
to
e5ef2a5
Compare
ready |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Make current benchmarking functions more robust to different OpenAI API implementations.
And next packages actually contain generated data. OpenAI API spec does not guarantee that first chunk is sent right before first token, so we can expect a delay between those 2 chunks. So, from perf benchmarking perspective, we need to assume that first token is contained in the second chunk.
choices
andusage
can be within the same chunk, so replaceif / else
with doubleif
, otherwiseoutput_tokens
can be left non-filled.Test Plan
N/A
Test Result
N/A
(Optional) Documentation Update
N/A