Skip to content

Conversation

@hnts03-moreh
Copy link

@hnts03-moreh hnts03-moreh commented Dec 4, 2025

Purpose

Support benchmark trimming with user-define time interval to get accurate decode related metrics.

VLLM's current benchmark does not support a trimming feature, which is essential for obtaining precise decode-related metrics. At the beginning of the benchmark, the vLLM inference server cannot fully utilize its decode capacity because the nature of prefill prevents it from processing large batches immediately. This means some time is required for the server to reach a fully loaded state for decoding. Similarly, at the end of the benchmark, as requests finish gradually, the server utilization drops. However, the vLLM benchmark cannot distinguish between this low-load data and the stable, high-load data, which corrupts the resulting decode metrics.

We have added --warmup-time and --cooldown-time options to support configuring the effective time interval for measurement.

The accurate time interval is calculated by the following formula: (Effective Time interval) = (Benchmark End Time (E) - Cooldown Time (c)) - (Benchmark Start Time (S) + Warmup Time (w))

Test Plan

#!/bin/bash
vllm bench serve \
    --backend vllm \
    --model "deepseek-ai/DeepSeek-R1" \
    --metric-percentiles "10,25,50,75,90" \
    --percentile-metrics "itl,tps,ttft,e2el" \
    --port 8001 \
    --num-prompts 4096 \
    --max-concurrency 2048 \
    --ignore-eos \
    --ready-check-timeout-sec 0 \
    --dataset-name sharegpt \
    --dataset-path  <sharegpt dataset path> \
    --sharegpt-input-len 1000 \
    --sharegpt-output-len 1000 \
    --warmup-time 15 \
    --cooldown-time 15 \
    --request-rate 50

Test Result

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             2048      
Request rate configured (RPS):           50.00     
Benchmark duration (s):                  363.04    
Total input tokens:                      4096000   
Total generated tokens:                  4096000   
Request throughput (req/s):              11.28     
Output token throughput (tok/s):         11282.58  
Peak output token throughput (tok/s):    22144.00  
Peak concurrent requests:                3072.00   
Total Token throughput (tok/s):          22565.16  
---------------Time to First Token----------------
Mean TTFT (ms):                          18562.13  
Median TTFT (ms):                        12916.93  
P10 TTFT (ms):                           3958.21   
P25 TTFT (ms):                           8711.41   
P50 TTFT (ms):                           12916.93  
P75 TTFT (ms):                           28272.30  
P90 TTFT (ms):                           41581.57  
---------------Inter-token Latency----------------
Mean ITL (ms):                           152.68    
Median ITL (ms):                         106.13    
P10 ITL (ms):                            85.20     
P25 ITL (ms):                            97.22     
P50 ITL (ms):                            106.13    
P75 ITL (ms):                            114.54    
P90 ITL (ms):                            132.27    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          171013.53 
Median E2EL (ms):                        179964.43 
P10 E2EL (ms):                           146520.16 
P25 E2EL (ms):                           159509.39 
P50 E2EL (ms):                           179964.43 
P75 E2EL (ms):                           182864.51 
P90 E2EL (ms):                           183505.13 
==================================================
tip: install termplotlib and gnuplot to plot the metrics
Serving Benchmark Result after warmup before cooldown
Warm-up Time:                            15.0      
Cool-down Time:                          15.0      
Total counted tokens at filtering:       3856547   
Benchmark duration (s):                  333.01    
Total generated tokens:                  3856547   
Output token throughput (tok/s):         11580.97  
---------------Inter-token Latency----------------
Mean ITL (ms):                           153.76    
Median ITL (ms):                         106.28    
P10 ITL (ms):                            85.20     
P25 ITL (ms):                            97.36     
P50 ITL (ms):                            106.28    
P75 ITL (ms):                            114.68    
P90 ITL (ms):                            132.17    
==================================================

The trimmed benchmark result is added at the bottom of the benchmark results. We can get

  1. trimmed duration (time interval)
  2. generated token num
  3. output token throughput
  4. and ITL

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@hnts03-moreh hnts03-moreh self-assigned this Dec 4, 2025
@github-actions
Copy link

github-actions bot commented Dec 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants