Skip to content

Conversation

@ywang96
Copy link
Member

@ywang96 ywang96 commented Sep 21, 2025

Purpose

Test Plan

10 QPS of VisionArena on Qwen3-VL 4B on A100

Test Result

Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================

This branch

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================

MMMU matched


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
@ywang96 ywang96 requested a review from sighingnow as a code owner September 21, 2025 08:27
@mergify mergify bot added the qwen Related to Qwen models label Sep 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization to the fast_pos_embed_interpolate method in vllm/model_executor/models/qwen3_vl.py. The changes refactor the method to perform computations on the GPU using vectorized PyTorch operations, avoiding expensive list manipulations and CPU-GPU data transfers. A constant num_grid_per_side is now pre-calculated in the __init__ method to avoid repeated calculations. The new implementation is more efficient and readable, leveraging batched tensor operations for embedding lookups and calculations, which should lead to the performance improvements shown in the PR description. The logic appears correct and functionally equivalent to the previous implementation. I have no high or critical severity comments on these changes.

Signed-off-by: Roger Wang <hey@rogerw.io>
@ywang96 ywang96 requested a review from Isotr0py September 21, 2025 08:41
@Isotr0py Isotr0py enabled auto-merge (squash) September 21, 2025 09:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 21, 2025
@Isotr0py Isotr0py merged commit 30d0891 into vllm-project:main Sep 21, 2025
59 checks passed
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
vllm-project#25337)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: charlifu <charlifu@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
#25337)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
skyloevil added a commit to skyloevil/vllm that referenced this pull request Oct 8, 2025
Replace slow bicubic interpolation with fast bilinear interpolation
for SigLIP vision encoder position embeddings.

Motivation:
- Current implementation uses F.interpolate(..., mode='bicubic') which
  requires 16-point sampling and is slow on GPU
- Following the optimization pattern from Qwen3-VL (commit 30d0891)
- Expected 3-4x speedup on GPU for vision encoding

Changes:
- Added fast_interpolate_pos_encoding() method using direct bilinear
  interpolation with vectorized index/weight computation
- Uses 4-point bilinear instead of 16-point bicubic
- Eliminates CPU-GPU transfers by computing all operations on device
- Batch embedding lookup reduces kernel launch overhead
- Updated forward() to use the optimized interpolation

Implementation Details:
- Bilinear interpolation: P = w00*P00 + w01*P01 + w10*P10 + w11*P11
- Vectorized via broadcasting: all weights computed in single operation
- Direct tensor indexing avoids permute/reshape overhead
- Follows exact pattern validated in Qwen3-VL optimization

Performance:
- CPU: Functional correctness verified (cosine sim > 0.91)
- GPU: Expected 3-4x speedup (requires GPU testing)
- Affects: Pixtral, PaliGemma, all models using SigLIP encoder

Testing:
- Numerical validation: max diff < 2.75, acceptable for learned embeddings
- Edge cases: same resolution, large resolution, non-square all pass
- Gradient flow: verified backward pass works correctly

Related:
- Qwen3-VL optimization: vllm-project#25337
- Pattern: bicubic → bilinear for position embeddings

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
vllm-project#25337)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
vllm-project#25337)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants