[Usage]: Need guidance reproducing benchmark results from PR #25337 — results differ significantly from reported data

## Background
Recently, we have been working on optimizing the position computation for multimodal models in vLLM.

During benchmarking, we noticed that our results were not as expected.

To investigate, we decided to reproduce the benchmark results from [PR #25337](https://github.com/vllm-project/vllm/pull/25337), comparing the performance before and after that PR was merged into the main branch.

- Before PR commit: cf56cf78b47e5f9b6a81ce0d50a94f9291922315

- After PR commit: 30d08911f7cf78287f8da003ddcc99f6ef196f9f
   
    <img width="1380" height="712" alt="Image" src="https://github.com/user-attachments/assets/afca55db-c443-4c98-ba6b-f656b070af5f" />

However, our reproduced results differ **significantly** from the performance data reported in the PR.

We’d like to understand whether this discrepancy may be caused by hardware differences, model choice, or benchmark setup.

**Who can help guide me?**

## Model and Environment
- Model used: Qwen/Qwen3-VL-30B-A3B-Instruct-FP8(The modelQwen3-VL-4B used in the PR could not be found on Hugging Face.)

- GPU: NVIDIA A100 PCIe

- vLLM startup command:
```bash
vllm serve "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384
```

## Benchmark Command
```bash
vllm bench serve \
  --backend openai-chat \
  --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" \
  --base-url "http://localhost:8000" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "hf" \
  --dataset-path "lmarena-ai/VisionArena-Chat" \
  --num-prompts 100 \
  --request-rate 10 \
  --save-result \
  --result-dir benchmarks_results \
  --result-filename test.json
```

## Our Benchmark Results
### Before PR #25337
```text
============ Serving Benchmark Result ============
Successful requests:                     100
Request rate configured (RPS):           10.00
Benchmark duration (s):                  16.91
Total input tokens:                      5280
Total generated tokens:                  11522
Request throughput (req/s):              5.91
Output token throughput (tok/s):         681.42
Peak output token throughput (tok/s):    2225.00
Peak concurrent requests:                97.00
Total Token throughput (tok/s):          993.68
---------------Time to First Token----------------
Mean TTFT (ms):                          1176.13
Median TTFT (ms):                        1185.79
P99 TTFT (ms):                           2178.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.39
Median TPOT (ms):                        78.68
P99 TPOT (ms):                           392.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           77.30
Median ITL (ms):                         42.31
P99 ITL (ms):                            581.15
==================================================
```

### After PR #25337
```text
============ Serving Benchmark Result ============
Successful requests:                     100
Request rate configured (RPS):           10.00
Benchmark duration (s):                  16.89
Total input tokens:                      5280
Total generated tokens:                  11640
Request throughput (req/s):              5.92
Output token throughput (tok/s):         689.02
Peak output token throughput (tok/s):    2178.00
Peak concurrent requests:                97.00
Total Token throughput (tok/s):          1001.57
---------------Time to First Token----------------
Mean TTFT (ms):                          1193.52
Median TTFT (ms):                        1285.23
P99 TTFT (ms):                           2111.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.84
Median TPOT (ms):                        78.00
P99 TPOT (ms):                           344.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.89
Median ITL (ms):                         42.30
P99 ITL (ms):                            597.42
==================================================
```

## Reference: Benchmark Results from PR #25337
### Main branch
```text
============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================
```

### This branch
```text
============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================
```

## Question
The results we obtained are noticeably different from the benchmark numbers shown in PR #25337.

Could this gap be explained by differences such as:

- Model: Qwen3-VL-4B vs. Qwen3-VL-30B-A3B-Instruct-FP8

- Hardware: A100 PCIe vs. SXM

- Dataset or benchmarking parameters:Has anyone else tried reproducing this PR and observed similar discrepancies?


**Thanks in advance for any help or clarification!**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: Need guidance reproducing benchmark results from PR #25337 — results differ significantly from reported data #27021

Background

Model and Environment

Benchmark Command

Our Benchmark Results

Before PR #25337

After PR #25337

Reference: Benchmark Results from PR #25337

Main branch

This branch

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: Need guidance reproducing benchmark results from PR #25337 — results differ significantly from reported data #27021

Description

Background

Model and Environment

Benchmark Command

Our Benchmark Results

Before PR #25337

After PR #25337

Reference: Benchmark Results from PR #25337

Main branch

This branch

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions