feat: add perf utils by viraatc · Pull Request #122 · mlcommons/endpoints

viraatc · 2026-02-10T08:38:06Z

What does this PR do?

addresses: #9

HTTP client performance testing utility.

Benchmarks send/recv rate of the HTTPEndpointClient using uvloop.
Can auto-launch a MaxThroughputServer or connect to an external endpoint.

Usage (see all available args in --help):
    python -m inference_endpoint.utils.benchmark_httpclient -w 8 -c 512 -d 20
    python -m inference_endpoint.utils.benchmark_httpclient --endpoint http://host:8080/v1/chat/completions
    python -m inference_endpoint.utils.benchmark_httpclient --no-pin --track-memory

Sweep modes (-w, -c, -l accept ranges; endpoints always included):
    -w 4:12           every int in [4, 12]
    -c 100:500:100    start:stop:step  -> [100, 200, 300, 400, 500]
    -w 1:32::12       start:stop::N    -> 12 evenly-spaced points in [1, 32]
    -l 32,128,512     explicit values
    -w 1:32::12 -c 100:500::4          cartesian product sweep
    --full                             preset sweep of common worker counts x prompt lengths (non-streaming)
    --full --stream                    preset sweep of common worker counts x prompt lengths (streaming)

example:

$ -> python scripts/benchmark_httpclient.py --duration 10 --full
================================================================================================
Sweep Summary: num_workers, prompt_length
================================================================================================
   num_workers |  prompt_length |    Send Rate |    Recv Rate | Outstanding |  Stall% |   Errors
------------------------------------------------------------------------------------------------
             1 |              1 |       80,703 |       49,278 |           0 |   62.3% |     0.0%
             1 |             32 |       80,038 |       48,142 |           0 |   61.5% |     0.0%
             1 |            128 |       79,333 |       48,868 |           0 |   62.0% |     0.0%
...
            16 |         16,384 |      103,547 |      103,541 |           0 |    0.0% |     0.0%
            16 |         32,768 |       73,652 |       73,636 |           0 |    0.0% |     0.0%
            16 |         65,536 |       26,006 |       26,006 |           0 |    0.0% |     0.0%
            16 |        131,072 |       19,434 |       19,433 |           0 |    0.0% |     0.0%


Plot saved to: /tmp/sweep_num_workers_1-16_x_prompt_length_1-131072_duration=3.0_max_in_flight=100000_pin=Tr
ue.png

$ -> python scripts/benchmark_httpclient.py -w 4:98:4 --stream --server-workers 16
...
==============================================================================================
Sweep Summary: num_workers
==============================================================================================
   num_workers |    Send Rate |    Recv Rate |   SSE-pkts/s | Outstanding |  Stall% |   Errors
----------------------------------------------------------------------------------------------
             4 |       14,787 |        4,805 |    4,814,196 |           0 |   86.2% |     0.0%
             8 |       19,579 |        9,480 |    9,499,254 |           0 |   81.9% |     0.0%
            12 |       24,152 |       13,855 |   13,882,346 |           0 |   78.9% |     0.0%
            16 |       27,946 |       17,755 |   17,790,948 |           0 |   73.8% |     0.0%
            20 |       31,415 |       21,201 |   21,243,302 |           0 |   67.9% |     0.0%
            24 |       33,869 |       23,596 |   23,643,404 |           0 |   61.6% |     0.0%
            28 |       36,177 |       26,064 |   26,116,147 |           0 |   56.2% |     0.0%
            32 |       38,805 |       28,520 |   28,576,789 |           0 |   48.8% |     0.0%
            36 |       41,288 |       30,458 |   30,518,869 |           0 |   42.9% |     0.0%
            40 |       43,885 |       32,100 |   32,164,009 |           0 |   36.0% |     0.0%
            44 |       45,934 |       35,565 |   35,635,725 |           0 |   27.0% |     0.0%
            48 |       46,878 |       36,733 |   36,806,714 |           0 |   22.8% |     0.0%
            52 |       47,495 |       35,641 |   35,711,910 |           0 |   19.9% |     0.0%
            56 |       49,991 |       37,925 |   38,001,183 |           0 |   14.1% |     0.0%
            60 |       53,353 |       40,325 |   40,405,908 |           0 |    9.0% |     0.0%
            64 |       56,786 |       43,011 |   43,097,440 |           0 |    5.6% |     0.0%
            68 |       60,130 |       45,769 |   45,860,070 |           0 |    2.4% |     0.0%
            72 |       63,623 |       49,915 |   50,015,115 |           0 |    0.5% |     0.0%
            76 |       61,537 |       51,891 |   51,994,283 |           0 |    0.0% |     0.0%
            80 |       59,533 |       54,540 |   54,648,673 |           0 |    0.0% |     0.0%
            84 |       60,139 |       55,623 |   55,734,724 |           0 |    0.0% |     0.0%
            88 |       60,035 |       59,991 |   60,111,393 |           0 |    0.0% |     0.0%
            92 |       61,096 |       61,075 |   61,196,969 |           0 |    0.0% |     0.0%
            96 |       61,690 |       61,674 |   61,797,769 |           0 |    0.0% |     0.0%
           100 |       62,063 |       62,055 |   62,179,067 |           0 |    0.0% |     0.0%
           104 |       62,567 |       62,559 |   62,683,778 |           0 |    0.0% |     0.0%
           108 |       63,729 |       63,720 |   63,847,133 |           0 |    0.0% |     0.0%
           112 |       62,865 |       50,857 |   50,959,197 |           0 |    0.0% |     0.0%
           116 |       63,247 |       50,351 |   50,451,365 |           0 |    0.0% |     0.0%
           120 |       63,367 |       52,499 |   52,603,988 |           0 |    0.0% |     0.0%
           124 |       62,059 |       52,932 |   53,037,533 |           0 |    0.0% |     0.0%
           128 |       61,529 |       51,434 |   51,536,902 |           0 |    0.0% |     0.0%
==============================================================================================

Plot saved to: /tmp/sweep_num_workers_4-128_duration=10.0_max_concurrency=100000_streaming=True.png

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

gemini-code-assist · 2026-02-10T08:38:25Z

Summary of Changes

Hello @viraatc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust set of performance utilities designed to thoroughly benchmark the HTTP client. It provides a dedicated script for running single performance tests or complex parameter sweeps, complete with live statistics, memory tracking, and CPU affinity controls. A new mock server is also included, allowing for isolated and high-throughput client-side performance measurements, with results automatically visualized through generated plots.

Highlights

New Performance Testing Utility: Introduced a new Python script (scripts/benchmark_httpclient.py) for comprehensive performance testing of the HTTP client, supporting both single runs and advanced parameter sweeps.
Max Throughput Mock Server: Developed a MaxThroughputServer (src/inference_endpoint/testing/max_throughput_server.py) to provide a high-throughput, low-latency mock OpenAI-compatible API server, enabling isolated client performance benchmarking.
Advanced Benchmarking Features: Implemented features within the benchmark utility for live statistics display, memory usage tracking, CPU affinity pinning for workers, and automatic plotting of sweep results using matplotlib.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pyproject.toml
- Added matplotlib>=3.8.0 to the performance dependency group.
scripts/benchmark_httpclient.py
- Added a new script to benchmark the HTTP client's send/receive rates.
- Implemented argument parsing for various benchmark configurations, including endpoint, duration, worker counts, connections, prompt length, streaming, and memory tracking.
- Included logic for single benchmark runs and parameter sweeps (cartesian product).
- Developed LiveDisplay for real-time statistics and memory usage monitoring.
- Integrated matplotlib for generating and saving plots of sweep results.
- Added CPU affinity pinning support for workers.
- Configured uvloop for high-performance I/O.
src/inference_endpoint/testing/max_throughput_server.py
- Added a new module defining MaxThroughputServer, a minimal OpenAI-compatible LLM API server.
- Implemented build_streaming_response and build_non_streaming_response to generate fixed, pre-compiled HTTP responses.
- Included RequestParser and ServerProtocol for efficient request handling.
- Designed the server to run with multiple worker processes using multiprocessing for high throughput.
- Provided options for live statistics display and auto-assigned port binding.

Activity

No human activity (comments, reviews, progress updates) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

Adds a high-throughput stub server and a benchmarking script to measure HTTP client send/recv throughput (including sweep + plotting), plus a dependency update to support plotting.

Changes:

Introduces MaxThroughputServer: a minimal OpenAI-compatible server returning pre-built responses for roofline-style client benchmarking.
Adds scripts/benchmark_httpclient.py with single-run + sweep modes, live stats, optional memory tracking, and plot generation.
Adds matplotlib to dependencies to support sweep plotting.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File	Description
src/inference_endpoint/testing/max_throughput_server.py	New minimal high-throughput HTTP stub server for isolating client throughput.
scripts/benchmark_httpclient.py	New benchmark utility with sweep modes, live stats, restartable local server, and plotting.
pyproject.toml	Adds matplotlib dependency (currently under `test` extras) for plot output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/testing/max_throughput_server.py

scripts/benchmark_httpclient.py

src/inference_endpoint/testing/max_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

pyproject.toml

gemini-code-assist

Code Review

The pull request introduces a new performance testing utility for HTTP clients, along with a mock maximum throughput server. The utility supports single runs and parameter sweeps, including CPU affinity pinning and memory tracking. It also adds matplotlib as a dependency for plotting sweep results. The overall structure and functionality appear sound, providing a comprehensive tool for benchmarking. There are a couple of areas where maintainability and efficiency could be improved.

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/testing/max_throughput_server.py

scripts/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/testing/max_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/testing/max_throughput_server.py

github-actions · 2026-02-10T09:28:23Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

arekay-nv

Thanks for this - definitely useful. Would love to try it out.

src/inference_endpoint/testing/max_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

scripts/benchmark_httpclient.py

src/inference_endpoint/utils/benchmark_httpclient.py

docs/CLIENT_PERFORMANCE_TUNING.md

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/testing/max_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

pyproject.toml

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/inference_endpoint/testing/max_throughput_server.py:1

The _restart_server function accesses private attributes of the MaxThroughputServer class, creating tight coupling. Consider adding a public restart or reconfigure method to the server class instead.

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/endpoint_client/utils.py

src/inference_endpoint/testing/max_throughput_server.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

arekay-nv

Awesome. Thanks!

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/endpoint_client/utils.py

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/endpoint_client/utils.py

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/testing/max_throughput_server.py

src/inference_endpoint/endpoint_client/utils.py

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

viraatc requested a review from a team as a code owner February 10, 2026 08:38

Copilot AI review requested due to automatic review settings February 10, 2026 08:38

github-actions bot requested review from arekay-nv and nvzhihanj February 10, 2026 08:38

Copilot AI reviewed Feb 10, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Outdated Show resolved Hide resolved

src/inference_endpoint/utils/benchmark_httpclient.py Show resolved Hide resolved

Copilot AI review requested due to automatic review settings February 10, 2026 08:51

Copilot AI reviewed Feb 10, 2026

View reviewed changes

src/inference_endpoint/testing/max_throughput_server.py Outdated Show resolved Hide resolved

scripts/benchmark_httpclient.py Outdated Show resolved Hide resolved

scripts/benchmark_httpclient.py Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings February 10, 2026 08:56

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 10, 2026 09:08

viraatc force-pushed the feat/viraatc-perf-utils branch from 88dc43d to d451922 Compare February 10, 2026 09:10

Copilot AI reviewed Feb 10, 2026

View reviewed changes

src/inference_endpoint/testing/max_throughput_server.py Outdated Show resolved Hide resolved

src/inference_endpoint/testing/max_throughput_server.py Show resolved Hide resolved

viraatc mentioned this pull request Feb 10, 2026

chore(http-client): cleanup types, improve coverage, remove orjson #121

Closed

10 tasks

mlcommons deleted a comment from github-actions bot Feb 10, 2026

viraatc force-pushed the feat/viraatc-perf-utils branch from d451922 to 723eea4 Compare February 10, 2026 09:28

arekay-nv reviewed Feb 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 10, 2026 22:13

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 10, 2026 23:04

Copilot AI reviewed Feb 10, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Show resolved Hide resolved

Copilot AI review requested due to automatic review settings February 10, 2026 23:11

Copilot AI reviewed Feb 10, 2026

View reviewed changes

viraatc force-pushed the feat/viraatc-perf-utils branch from e77e711 to 8bbb5e6 Compare February 10, 2026 23:36

Copilot AI review requested due to automatic review settings February 10, 2026 23:37

viraatc force-pushed the feat/viraatc-perf-utils branch from 8bbb5e6 to db397d0 Compare February 10, 2026 23:37

Copilot AI reviewed Feb 10, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Outdated Show resolved Hide resolved

viraatc mentioned this pull request Feb 10, 2026

Perf: Analayze the roofline of the inference endpoints #9

Open

viraatc force-pushed the feat/viraatc-perf-utils branch from db397d0 to 2cf3680 Compare February 11, 2026 00:05

Copilot AI review requested due to automatic review settings February 11, 2026 00:17

viraatc force-pushed the feat/viraatc-perf-utils branch from 2cf3680 to e15536b Compare February 11, 2026 00:17

Copilot AI reviewed Feb 11, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Show resolved Hide resolved

src/inference_endpoint/utils/benchmark_httpclient.py Show resolved Hide resolved

src/inference_endpoint/utils/benchmark_httpclient.py Show resolved Hide resolved

github-code-quality bot found potential problems Feb 11, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Fixed Show fixed Hide fixed

src/inference_endpoint/utils/benchmark_httpclient.py Fixed Show fixed Hide fixed

arekay-nv approved these changes Feb 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 13, 2026 00:47

Copilot started reviewing on behalf of viraatc February 13, 2026 00:48 View session

github-code-quality bot found potential problems Feb 13, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Fixed Show fixed Hide fixed

src/inference_endpoint/utils/benchmark_httpclient.py Fixed Show fixed Hide fixed

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 14, 2026 00:06

Copilot started reviewing on behalf of viraatc February 14, 2026 00:06 View session

github-code-quality bot found potential problems Feb 14, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Fixed Show fixed Hide fixed

src/inference_endpoint/utils/benchmark_httpclient.py Fixed Show fixed Hide fixed

Copilot AI reviewed Feb 14, 2026

View reviewed changes

viraatc added 5 commits February 13, 2026 16:13

add perf utils

cb6ca3a

update

905e198

add sse-pkts/s metric

06141dc

add sse-pkts/s metric #2

17cbbb5

updates

3465924

viraatc force-pushed the feat/viraatc-perf-utils branch from 65efeb2 to 3465924 Compare February 14, 2026 00:13

updates

708a853

Copilot AI review requested due to automatic review settings February 14, 2026 00:21

Copilot started reviewing on behalf of viraatc February 14, 2026 00:21 View session

github-code-quality bot found potential problems Feb 14, 2026

View reviewed changes

src/inference_endpoint/utils/benchmark_httpclient.py Dismissed Show dismissed Hide dismissed

src/inference_endpoint/utils/benchmark_httpclient.py Dismissed Show dismissed Hide dismissed

Copilot AI reviewed Feb 14, 2026

View reviewed changes

viraatc merged commit 24272d0 into main Feb 14, 2026
10 checks passed

github-actions bot locked and limited conversation to collaborators Feb 14, 2026

viraatc deleted the feat/viraatc-perf-utils branch February 14, 2026 00:51

Conversation

viraatc commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc commented Feb 10, 2026 •

edited

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading