Skip to content

[Bench] make OpenAI based bench functions more robust to different OpenAI impls #20070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 39 additions & 18 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ async def async_request_openai_completions(
) as response:
if response.status == 200:
first_chunk_received = False
first_token_received = False
async for chunk_bytes in response.content:
chunk_bytes = chunk_bytes.strip()
if not chunk_bytes:
Expand All @@ -309,24 +310,31 @@ async def async_request_openai_completions(
# usage summary response without a token so we
# want to check a token was generated
if choices := data.get("choices"):
first_chunk_received = True
# Note that text could be empty here
# e.g. for special tokens
text = choices[0].get("text")
chunk_has_valid_text = text and len(text) > 0
timestamp = time.perf_counter()
# First token
if not first_chunk_received:
first_chunk_received = True
ttft = time.perf_counter() - st
output.ttft = ttft
if chunk_has_valid_text:
if not first_token_received:
first_token_received = True
ttft = timestamp - st
output.ttft = ttft

# Decoding phase
else:
output.itl.append(timestamp - most_recent_timestamp)
# Decoding phase
else:
output.itl.append(
timestamp - most_recent_timestamp
)

most_recent_timestamp = timestamp
generated_text += text or ""
generated_text += text
if usage := data.get("usage"):
output.output_tokens = usage.get("completion_tokens")

most_recent_timestamp = timestamp

if first_chunk_received:
output.success = True
else:
Expand Down Expand Up @@ -416,19 +424,32 @@ async def async_request_openai_chat_completions(
timestamp = time.perf_counter()
data = json.loads(chunk)

# NOTE: Some completion API might have a last
# usage summary response without a token so we
# want to check a token was generated
if choices := data.get("choices"):
content = choices[0]["delta"].get("content")
# First token
if ttft == 0.0:
ttft = timestamp - st
output.ttft = ttft

# Decoding phase
else:
output.itl.append(timestamp - most_recent_timestamp)
# NOTE: Some completion API might send first chunk
# right after they recieved a request, not just before
# a first generated token. It significantly affects
# TTFT. First chunk in v1/chat/completions don't have
# actual content, so, we can rely on it
chunk_has_valid_content = content and len(content) > 0
if chunk_has_valid_content:
# First token
if ttft == 0.0:
ttft = timestamp - st
output.ttft = ttft

generated_text += content or ""
elif usage := data.get("usage"):
# Decoding phase
else:
output.itl.append(
timestamp - most_recent_timestamp
)

generated_text += content
if usage := data.get("usage"):
Comment on lines +451 to +452
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing elif usage := data.get("usage"): to if usage := data.get("usage"): is important for correctness. It allows the usage information to be processed even if the chunk also contains choices data. Some OpenAI API implementations might include both choices and usage in the same chunk, and using elif would prevent output_tokens from being correctly populated in such cases. This ensures that output_tokens is always filled when usage data is available.

output.output_tokens = usage.get("completion_tokens")

most_recent_timestamp = timestamp
Expand Down