- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Description
Background
Recently, we have been working on optimizing the position computation for multimodal models in vLLM.
During benchmarking, we noticed that our results were not as expected.
To investigate, we decided to reproduce the benchmark results from PR #25337, comparing the performance before and after that PR was merged into the main branch.
However, our reproduced results differ significantly from the performance data reported in the PR.
We’d like to understand whether this discrepancy may be caused by hardware differences, model choice, or benchmark setup.
Who can help guide me?
Model and Environment
- 
Model used: Qwen/Qwen3-VL-30B-A3B-Instruct-FP8(The modelQwen3-VL-4B used in the PR could not be found on Hugging Face.) 
- 
GPU: NVIDIA A100 PCIe 
- 
vLLM startup command: 
vllm serve "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384Benchmark Command
vllm bench serve \
  --backend openai-chat \
  --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
  --base-url "http://localhost:8000" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "hf" \
  --dataset-path "lmarena-ai/VisionArena-Chat" \
  --num-prompts 100 \
  --request-rate 10 \
  --save-result \
  --result-dir benchmarks_results \
  --result-filename test.jsonOur Benchmark Results
Before PR #25337
============ Serving Benchmark Result ============
Successful requests:                     100
Request rate configured (RPS):           10.00
Benchmark duration (s):                  16.91
Total input tokens:                      5280
Total generated tokens:                  11522
Request throughput (req/s):              5.91
Output token throughput (tok/s):         681.42
Peak output token throughput (tok/s):    2225.00
Peak concurrent requests:                97.00
Total Token throughput (tok/s):          993.68
---------------Time to First Token----------------
Mean TTFT (ms):                          1176.13
Median TTFT (ms):                        1185.79
P99 TTFT (ms):                           2178.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.39
Median TPOT (ms):                        78.68
P99 TPOT (ms):                           392.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           77.30
Median ITL (ms):                         42.31
P99 ITL (ms):                            581.15
==================================================
After PR #25337
============ Serving Benchmark Result ============
Successful requests:                     100
Request rate configured (RPS):           10.00
Benchmark duration (s):                  16.89
Total input tokens:                      5280
Total generated tokens:                  11640
Request throughput (req/s):              5.92
Output token throughput (tok/s):         689.02
Peak output token throughput (tok/s):    2178.00
Peak concurrent requests:                97.00
Total Token throughput (tok/s):          1001.57
---------------Time to First Token----------------
Mean TTFT (ms):                          1193.52
Median TTFT (ms):                        1285.23
P99 TTFT (ms):                           2111.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.84
Median TPOT (ms):                        78.00
P99 TPOT (ms):                           344.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.89
Median ITL (ms):                         42.30
P99 ITL (ms):                            597.42
==================================================
Reference: Benchmark Results from PR #25337
Main branch
============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================
This branch
============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================
Question
The results we obtained are noticeably different from the benchmark numbers shown in PR #25337.
Could this gap be explained by differences such as:
- 
Model: Qwen3-VL-4B vs. Qwen3-VL-30B-A3B-Instruct-FP8 
- 
Hardware: A100 PCIe vs. SXM 
- 
Dataset or benchmarking parameters:Has anyone else tried reproducing this PR and observed similar discrepancies? 
Thanks in advance for any help or clarification!
