Max RPS in sweep mode could be higher

Currently, in sweep mode, we perform N tests that are linearly spaced between the RPS values measured during a sequential test and a throughput test. However, in certain cases (e.g., long output sequences >1500 tokens or very short test durations), throughput mode may under-estimate the maximum throughput the server can handle.

In throughput mode, if we think of the model server as a black box that receives an overload of requests and, after an initial delay (finishing the first request), completes requests at some ~constant rate, the goal is to determine the server’s output rate. Currently, we calculate the average RPS over the entire test duration. As the test duration increases, this average RPS will approach the server's maximum sustained output rate. However, for shorter test durations or long output sequences, the initial delay (before the server reaches its steady state) significantly impacts the total test time. This leads to an average RPS that is much lower than the true maximum rate the server can achieve once the "pipeline is full."

I propose that we calculate the upper bound for the RPS sweep using the time window from when the first request completes to when the last request completes. Specifically, the RPS should be calculated as:

RPS = (len(requests) - 1) / (max(requests["end_time"]) - min(requests["end_time"]))

I think this approach would more accurately reflect the server's true maximum throughput in terms of RPS. What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Max RPS in sweep mode could be higher #93

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Max RPS in sweep mode could be higher #93

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions