Skip to content

Max RPS in sweep mode could be higher #93

Open
0 of 1 issue completed
Open
Feature
0 of 1 issue completed
@dagrayvid

Description

@dagrayvid

Currently, in sweep mode, we perform N tests that are linearly spaced between the RPS values measured during a sequential test and a throughput test. However, in certain cases (e.g., long output sequences >1500 tokens or very short test durations), throughput mode may under-estimate the maximum throughput the server can handle.

In throughput mode, if we think of the model server as a black box that receives an overload of requests and, after an initial delay (finishing the first request), completes requests at some ~constant rate, the goal is to determine the server’s output rate. Currently, we calculate the average RPS over the entire test duration. As the test duration increases, this average RPS will approach the server's maximum sustained output rate. However, for shorter test durations or long output sequences, the initial delay (before the server reaches its steady state) significantly impacts the total test time. This leads to an average RPS that is much lower than the true maximum rate the server can achieve once the "pipeline is full."

I propose that we calculate the upper bound for the RPS sweep using the time window from when the first request completes to when the last request completes. Specifically, the RPS should be calculated as:

RPS = (len(requests) - 1) / (max(requests["end_time"]) - min(requests["end_time"]))

I think this approach would more accurately reflect the server's true maximum throughput in terms of RPS. What do you think?

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions