[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

ilyal-cerebras · 2025-06-25T10:22:23Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Make current benchmarking functions more robust to different OpenAI API implementations.

During chat completion, OpenAI backend is assumed to send an initial package to denote start of text generation:

chunk = {"id":"chatcmpl-72ddd64281f649b1ab1d32f5a92e42d3","object":"chat.completion.chunk","created":1750845197,"model":"XXX","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
chunk = {"id":"chatcmpl-72ddd64281f649b1ab1d32f5a92e42d3","object":"chat.completion.chunk","created":1750845197,"model":"XXX","choices":[{"index":0,"delta":{"content":"(","tool_calls":[]}}]}

And next packages actually contain generated data. OpenAI API spec does not guarantee that first chunk is sent right before first token, so we can expect a delay between those 2 chunks. So, from perf benchmarking perspective, we need to assume that first token is contained in the second chunk.

Similarly, choices and usage can be within the same chunk, so replace if / else with double if, otherwise output_tokens can be left non-filled.

Test Plan

N/A

Test Result

N/A

(Optional) Documentation Update

N/A

github-actions · 2025-06-25T10:22:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @ilyal-cerebras, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've made some adjustments to our OpenAI API benchmarking functions to enhance their robustness and accuracy when dealing with various OpenAI API implementations. The primary goal is to ensure our Time To First Token (TTFT) measurements are precise and that we reliably capture all relevant output metrics, even when API responses deviate slightly from typical patterns, such as sending initial empty chunks or combining content and usage data.

Highlights

TTFT Calculation Refinement: I've updated the async_request_openai_chat_completions function to calculate Time To First Token (TTFT) only when the first actual content-bearing token is received. This addresses scenarios where some OpenAI API implementations might send an initial 'empty' chunk, ensuring more accurate TTFT measurements by waiting for the first meaningful content.
Robust Output Parsing: In async_request_openai_chat_completions, I've changed the parsing logic from an elif to an if statement for handling usage information. This allows both choices (generated content) and usage (token count) data to be processed correctly if they appear within the same API response chunk, preventing output_tokens from being missed.
Timing Precision: I've adjusted the TTFT calculation in async_request_openai_completions to use the timestamp of the received chunk instead of time.perf_counter(). This provides a more precise and consistent measurement based on the actual chunk arrival time, aligning with the goal of robust benchmarking.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly improves the robustness and accuracy of the OpenAI-based benchmarking functions. The changes correctly handle cases where initial API responses might not contain actual content and ensure that both choices and usage data are processed correctly when present in the same chunk. These are important fixes for reliable performance measurement.

gemini-code-assist · 2025-06-25T10:23:12Z

benchmarks/backend_request_func.py

@@ -316,7 +316,7 @@ async def async_request_openai_completions(
                                # First token
                                if not first_chunk_received:
                                    first_chunk_received = True
-                                    ttft = time.perf_counter() - st
+                                    ttft = timestamp - st


This change correctly uses the timestamp variable, which is updated per chunk, for calculating the Time To First Token (TTFT). This ensures that the TTFT is accurately measured based on the arrival of the first meaningful token, rather than just the first chunk, which might be empty. This aligns with the PR description's goal of making the benchmark more robust to different OpenAI API implementations.

gemini-code-assist · 2025-06-25T10:23:12Z

benchmarks/backend_request_func.py

+                                if content:
+                                    # First token
+                                    if ttft == 0.0:
+                                        ttft = timestamp - st
+                                        output.ttft = ttft

-                                generated_text += content or ""
-                            elif usage := data.get("usage"):
+                                    # Decoding phase
+                                    else:
+                                        output.itl.append(timestamp - most_recent_timestamp)
+
+                                    generated_text += content


Wrapping the TTFT and decoding phase logic within an if content: block is a crucial improvement. Some OpenAI API implementations might send an initial chunk with no actual content, which would incorrectly trigger TTFT calculation. This change ensures that TTFT is measured only when a chunk contains actual generated text, leading to more accurate benchmarking.

gemini-code-assist · 2025-06-25T10:23:12Z

benchmarks/backend_request_func.py

+                                    generated_text += content
+                            if usage := data.get("usage"):


Changing elif usage := data.get("usage"): to if usage := data.get("usage"): is important for correctness. It allows the usage information to be processed even if the chunk also contains choices data. Some OpenAI API implementations might include both choices and usage in the same chunk, and using elif would prevent output_tokens from being correctly populated in such cases. This ensures that output_tokens is always filled when usage data is available.

…enAI impls Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>

ilyal-cerebras · 2025-06-26T08:54:25Z

ready

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

mergify bot added the performance Performance-related issues label Jun 25, 2025

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

ilyal-cerebras force-pushed the benchmark-openai-func branch 2 times, most recently from 0b21f4e to e872136 Compare June 25, 2025 11:24

[Bench] make OpenAI based bench functions more robust to different Op…

e5ef2a5

…enAI impls Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>

ilyal-cerebras force-pushed the benchmark-openai-func branch from e872136 to e5ef2a5 Compare June 25, 2025 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

ilyal-cerebras commented Jun 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 25, 2025

Uh oh!

gemini-code-assist bot Jun 25, 2025

Uh oh!

gemini-code-assist bot Jun 25, 2025

Uh oh!

ilyal-cerebras commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

Are you sure you want to change the base?

[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

Conversation

ilyal-cerebras commented Jun 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

ilyal-cerebras commented Jun 26, 2025

Uh oh!

Uh oh!

ilyal-cerebras commented Jun 25, 2025 •

edited by github-actions bot

Loading