[Feat] Benchmark trimming #1

hnts03-moreh · 2025-12-04T23:25:39Z

Purpose

Support benchmark trimming with user-define time interval to get accurate decode related metrics.

VLLM's current benchmark does not support a trimming feature, which is essential for obtaining precise decode-related metrics. At the beginning of the benchmark, the vLLM inference server cannot fully utilize its decode capacity because the nature of prefill prevents it from processing large batches immediately. This means some time is required for the server to reach a fully loaded state for decoding. Similarly, at the end of the benchmark, as requests finish gradually, the server utilization drops. However, the vLLM benchmark cannot distinguish between this low-load data and the stable, high-load data, which corrupts the resulting decode metrics.

We have added --warmup-time and --cooldown-time options to support configuring the effective time interval for measurement.

The accurate time interval is calculated by the following formula: (Effective Time interval) = (Benchmark End Time (E) - Cooldown Time (c)) - (Benchmark Start Time (S) + Warmup Time (w))

Test Plan

#!/bin/bash
vllm bench serve \
    --backend vllm \
    --model "deepseek-ai/DeepSeek-R1" \
    --metric-percentiles "10,25,50,75,90" \
    --percentile-metrics "itl,tps,ttft,e2el" \
    --port 8001 \
    --num-prompts 4096 \
    --max-concurrency 2048 \
    --ignore-eos \
    --ready-check-timeout-sec 0 \
    --dataset-name sharegpt \
    --dataset-path  <sharegpt dataset path> \
    --sharegpt-input-len 1000 \
    --sharegpt-output-len 1000 \
    --warmup-time 15 \
    --cooldown-time 15 \
    --request-rate 50

Test Result

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             2048      
Request rate configured (RPS):           50.00     
Benchmark duration (s):                  363.04    
Total input tokens:                      4096000   
Total generated tokens:                  4096000   
Request throughput (req/s):              11.28     
Output token throughput (tok/s):         11282.58  
Peak output token throughput (tok/s):    22144.00  
Peak concurrent requests:                3072.00   
Total Token throughput (tok/s):          22565.16  
---------------Time to First Token----------------
Mean TTFT (ms):                          18562.13  
Median TTFT (ms):                        12916.93  
P10 TTFT (ms):                           3958.21   
P25 TTFT (ms):                           8711.41   
P50 TTFT (ms):                           12916.93  
P75 TTFT (ms):                           28272.30  
P90 TTFT (ms):                           41581.57  
---------------Inter-token Latency----------------
Mean ITL (ms):                           152.68    
Median ITL (ms):                         106.13    
P10 ITL (ms):                            85.20     
P25 ITL (ms):                            97.22     
P50 ITL (ms):                            106.13    
P75 ITL (ms):                            114.54    
P90 ITL (ms):                            132.27    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          171013.53 
Median E2EL (ms):                        179964.43 
P10 E2EL (ms):                           146520.16 
P25 E2EL (ms):                           159509.39 
P50 E2EL (ms):                           179964.43 
P75 E2EL (ms):                           182864.51 
P90 E2EL (ms):                           183505.13 
==================================================
tip: install termplotlib and gnuplot to plot the metrics
Serving Benchmark Result after warmup before cooldown
Warm-up Time:                            15.0      
Cool-down Time:                          15.0      
Total counted tokens at filtering:       3856547   
Benchmark duration (s):                  333.01    
Total generated tokens:                  3856547   
Output token throughput (tok/s):         11580.97  
---------------Inter-token Latency----------------
Mean ITL (ms):                           153.76    
Median ITL (ms):                         106.28    
P10 ITL (ms):                            85.20     
P25 ITL (ms):                            97.36     
P50 ITL (ms):                            106.28    
P75 ITL (ms):                            114.68    
P90 ITL (ms):                            132.17    
==================================================

The trimmed benchmark result is added at the bottom of the benchmark results. We can get

trimmed duration (time interval)
generated token num
output token throughput
and ITL

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-12-04T23:25:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

benchmark trim

c318066

hnts03-moreh requested a review from jiminpark-moreh December 4, 2025 23:25

hnts03-moreh self-assigned this Dec 4, 2025

jiminpark-moreh approved these changes Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Benchmark trimming #1

[Feat] Benchmark trimming #1

Uh oh!

hnts03-moreh commented Dec 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feat] Benchmark trimming #1

Are you sure you want to change the base?

[Feat] Benchmark trimming #1

Uh oh!

Conversation

hnts03-moreh commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnts03-moreh commented Dec 4, 2025 •

edited by github-actions bot

Loading