[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

ywang96 · 2025-09-21T08:27:19Z

Purpose

Test Plan

10 QPS of VisionArena on Qwen3-VL 4B on A100

Test Result

Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.85    
Total input tokens:                      94327     
Total generated tokens:                  120882    
Request throughput (req/s):              9.82      
Output token throughput (tok/s):         1186.81   
Peak output token throughput (tok/s):    2862.00   
Peak concurrent requests:                133.00    
Total Token throughput (tok/s):          2112.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          229.53    
Median TTFT (ms):                        180.19    
P99 TTFT (ms):                           928.83    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.65     
Median TPOT (ms):                        36.29     
P99 TPOT (ms):                           87.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         17.36     
P99 ITL (ms):                            186.27    
==================================================

This branch

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  101.66    
Total input tokens:                      94327     
Total generated tokens:                  120735    
Request throughput (req/s):              9.84      
Output token throughput (tok/s):         1187.67   
Peak output token throughput (tok/s):    2310.00   
Peak concurrent requests:                124.00    
Total Token throughput (tok/s):          2115.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          203.78    
Median TTFT (ms):                        162.26    
P99 TTFT (ms):                           848.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.27     
Median TPOT (ms):                        31.53     
P99 TPOT (ms):                           80.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.00     
Median ITL (ms):                         16.07     
P99 ITL (ms):                            170.49    
==================================================

MMMU matched

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>

gemini-code-assist

Code Review

This pull request introduces a performance optimization to the fast_pos_embed_interpolate method in vllm/model_executor/models/qwen3_vl.py. The changes refactor the method to perform computations on the GPU using vectorized PyTorch operations, avoiding expensive list manipulations and CPU-GPU data transfers. A constant num_grid_per_side is now pre-calculated in the __init__ method to avoid repeated calculations. The new implementation is more efficient and readable, leveraging batched tensor operations for embedding lookups and calculations, which should lead to the performance improvements shown in the PR description. The logic appears correct and functionally equivalent to the previous implementation. I have no high or critical severity comments on these changes.

Signed-off-by: Roger Wang <hey@rogerw.io>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: charlifu <charlifu@amd.com>

#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Replace slow bicubic interpolation with fast bilinear interpolation for SigLIP vision encoder position embeddings. Motivation: - Current implementation uses F.interpolate(..., mode='bicubic') which requires 16-point sampling and is slow on GPU - Following the optimization pattern from Qwen3-VL (commit 30d0891) - Expected 3-4x speedup on GPU for vision encoding Changes: - Added fast_interpolate_pos_encoding() method using direct bilinear interpolation with vectorized index/weight computation - Uses 4-point bilinear instead of 16-point bicubic - Eliminates CPU-GPU transfers by computing all operations on device - Batch embedding lookup reduces kernel launch overhead - Updated forward() to use the optimized interpolation Implementation Details: - Bilinear interpolation: P = w00*P00 + w01*P01 + w10*P10 + w11*P11 - Vectorized via broadcasting: all weights computed in single operation - Direct tensor indexing avoids permute/reshape overhead - Follows exact pattern validated in Qwen3-VL optimization Performance: - CPU: Functional correctness verified (cosine sim > 0.91) - GPU: Expected 3-4x speedup (requires GPU testing) - Affects: Pixtral, PaliGemma, all models using SigLIP encoder Testing: - Numerical validation: max diff < 2.75, acceptable for learned embeddings - Edge cases: same resolution, large resolution, non-square all pass - Gradient flow: verified backward pass works correctly Related: - Qwen3-VL optimization: vllm-project#25337 - Pattern: bicubic → bilinear for position embeddings Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

ywang96 added 3 commits September 21, 2025 06:03

add

aa6bd8d

Signed-off-by: Roger Wang <hey@rogerw.io>

update

51ae2dc

Signed-off-by: Roger Wang <hey@rogerw.io>

update

cfb879e

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 requested a review from sighingnow as a code owner September 21, 2025 08:27

mergify bot added the qwen Related to Qwen models label Sep 21, 2025

gemini-code-assist bot reviewed Sep 21, 2025

View reviewed changes

cleanup

dde249f

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 requested a review from Isotr0py September 21, 2025 08:41

ywang96 mentioned this pull request Sep 21, 2025

[Model] Support Qwen3-VL Model Series #24727

Merged

12 tasks

Isotr0py approved these changes Sep 21, 2025

View reviewed changes

Isotr0py enabled auto-merge (squash) September 21, 2025 09:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 21, 2025

Isotr0py merged commit 30d0891 into vllm-project:main Sep 21, 2025
59 checks passed

Isotr0py mentioned this pull request Sep 21, 2025

[Perf] Further optimization for Qwen3-VL fast_pos_embed_interpolate #25347

Merged

5 tasks

kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

c3f7ed3

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

9a90102

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

aacff96

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: charlifu <charlifu@amd.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

5fd95c7

#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

cf2afc5

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

54d0bf0

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

lgeiger mentioned this pull request Oct 11, 2025

[Models][Qwen3VL] Speedup fast_pos_embed_interpolate #26647

Merged

deitxfge mentioned this pull request Oct 16, 2025

[Usage]: Need guidance reproducing benchmark results from PR #25337 — results differ significantly from reported data #27021

Open

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

e3199fd

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate (

17ad2eb

vllm-project#25337) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

Uh oh!

ywang96 commented Sep 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate #25337

[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate #25337

Uh oh!

Conversation

ywang96 commented Sep 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

[MM][Perf] Minor Optimization on Qwen3-VL `fast_pos_embed_interpolate` #25337

ywang96 commented Sep 21, 2025 •

edited by github-actions bot

Loading