Skip to content

[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ilyal-cerebras
Copy link
Contributor

@ilyal-cerebras ilyal-cerebras commented Jun 25, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Make current benchmarking functions more robust to different OpenAI API implementations.

  1. During chat completion, OpenAI backend is assumed to send an initial package to denote start of text generation:
chunk = {"id":"chatcmpl-72ddd64281f649b1ab1d32f5a92e42d3","object":"chat.completion.chunk","created":1750845197,"model":"XXX","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
chunk = {"id":"chatcmpl-72ddd64281f649b1ab1d32f5a92e42d3","object":"chat.completion.chunk","created":1750845197,"model":"XXX","choices":[{"index":0,"delta":{"content":"(","tool_calls":[]}}]}

And next packages actually contain generated data. OpenAI API spec does not guarantee that first chunk is sent right before first token, so we can expect a delay between those 2 chunks. So, from perf benchmarking perspective, we need to assume that first token is contained in the second chunk.

  1. Similarly, choices and usage can be within the same chunk, so replace if / else with double if, otherwise output_tokens can be left non-filled.

Test Plan

N/A

Test Result

N/A

(Optional) Documentation Update

N/A

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ilyal-cerebras, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've made some adjustments to our OpenAI API benchmarking functions to enhance their robustness and accuracy when dealing with various OpenAI API implementations. The primary goal is to ensure our Time To First Token (TTFT) measurements are precise and that we reliably capture all relevant output metrics, even when API responses deviate slightly from typical patterns, such as sending initial empty chunks or combining content and usage data.

Highlights

  • TTFT Calculation Refinement: I've updated the async_request_openai_chat_completions function to calculate Time To First Token (TTFT) only when the first actual content-bearing token is received. This addresses scenarios where some OpenAI API implementations might send an initial 'empty' chunk, ensuring more accurate TTFT measurements by waiting for the first meaningful content.
  • Robust Output Parsing: In async_request_openai_chat_completions, I've changed the parsing logic from an elif to an if statement for handling usage information. This allows both choices (generated content) and usage (token count) data to be processed correctly if they appear within the same API response chunk, preventing output_tokens from being missed.
  • Timing Precision: I've adjusted the TTFT calculation in async_request_openai_completions to use the timestamp of the received chunk instead of time.perf_counter(). This provides a more precise and consistent measurement based on the actual chunk arrival time, aligning with the goal of robust benchmarking.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the performance Performance-related issues label Jun 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the robustness and accuracy of the OpenAI-based benchmarking functions. The changes correctly handle cases where initial API responses might not contain actual content and ensure that both choices and usage data are processed correctly when present in the same chunk. These are important fixes for reliable performance measurement.

@@ -316,7 +316,7 @@ async def async_request_openai_completions(
# First token
if not first_chunk_received:
first_chunk_received = True
ttft = time.perf_counter() - st
ttft = timestamp - st
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly uses the timestamp variable, which is updated per chunk, for calculating the Time To First Token (TTFT). This ensures that the TTFT is accurately measured based on the arrival of the first meaningful token, rather than just the first chunk, which might be empty. This aligns with the PR description's goal of making the benchmark more robust to different OpenAI API implementations.

Comment on lines 430 to 451
if content:
# First token
if ttft == 0.0:
ttft = timestamp - st
output.ttft = ttft

generated_text += content or ""
elif usage := data.get("usage"):
# Decoding phase
else:
output.itl.append(timestamp - most_recent_timestamp)

generated_text += content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Wrapping the TTFT and decoding phase logic within an if content: block is a crucial improvement. Some OpenAI API implementations might send an initial chunk with no actual content, which would incorrectly trigger TTFT calculation. This change ensures that TTFT is measured only when a chunk contains actual generated text, leading to more accurate benchmarking.

Comment on lines +440 to +452
generated_text += content
if usage := data.get("usage"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing elif usage := data.get("usage"): to if usage := data.get("usage"): is important for correctness. It allows the usage information to be processed even if the chunk also contains choices data. Some OpenAI API implementations might include both choices and usage in the same chunk, and using elif would prevent output_tokens from being correctly populated in such cases. This ensures that output_tokens is always filled when usage data is available.

@ilyal-cerebras ilyal-cerebras force-pushed the benchmark-openai-func branch 2 times, most recently from 0b21f4e to e872136 Compare June 25, 2025 11:24
…enAI impls

Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
@ilyal-cerebras ilyal-cerebras force-pushed the benchmark-openai-func branch from e872136 to e5ef2a5 Compare June 25, 2025 14:44
@ilyal-cerebras
Copy link
Contributor Author

ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant