Closed
Description
When doing a GuideLLM sweep, the token throughput numbers for each RPS test are lower than a user would expect given the vLLM generation throughput metric.
I think there are two main reasons for this.
- Server is under-utilized at the beginning of the run while the number of requests in flight is ramping up
- GuideLLM undercounts the tokens generated towards the end of the test duration, due to cancelled requests not being counted (see Record partially completed request metrics #77)
I propose that we add an additional performance metric, we could call it something along the lines of "peak token generation throughput", hopefully in fewer words.
I have a couple of ideas on how this could be calculated:
- Calculate token throughput in a sliding window of 30s or 1 minute and use the max, using decode times across all ongoing requests in that window.
- Calculate token throughput in the duration from the time that the first completed request finished, and the time last completed request started. This doesn't perfectly eliminate the "underutilization" at the beginning and end of the test, but gets pretty close.
WDYT?