[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377

cynthieye · 2025-03-06T17:51:47Z

qwen2-vl logic optimization: During each forward propagation, the xformer branch of Qwen2VisionTransformer will execute multiple tensor tolist methods (flash attn branch will execute multiple tensor items) to force the GPU tensor to be copied to the CPU, triggering CUDAMemcpyAsync to increase time consumption. Since the input and output are the same multiple times, it will be executed once, and the remaining will reuse the first result. After optimization, the online environment xformer branch QPS can be improved by 15%, and the flash attn branch QPS can be improved by 7%

Isotr0py

Thanks for this optimization! Can you please also update qwen2.5-vl as well?

vllm/model_executor/models/qwen2_vl.py

Signed-off-by: cynthieye <987073381@qq.com>

ywang96 · 2025-03-11T04:27:44Z

@cynthieye Thank you for making this PR! Can you update this branch with our main branch? I think thr CI error should be fixed on main a while ago.

ywang96

Left a few comments - Otherwise LGTM!

ywang96 · 2025-03-11T04:31:53Z

vllm/model_executor/models/qwen2_5_vl.py

@@ -259,6 +259,8 @@ def forward(
        x: torch.Tensor,
        cu_seqlens: torch.Tensor,
        rotary_pos_emb: torch.Tensor,
+        max_seqlen: int = None,


Shouldn't max_seqlen be also Optional[int]?

ywang96 · 2025-03-11T04:33:08Z

vllm/model_executor/models/qwen2_5_vl.py

+        max_seqlen: int,
+        seqlens: list[int],


Please modify the typing accordingly

ywang96 · 2025-03-11T04:33:18Z

vllm/model_executor/models/qwen2_vl.py

+        max_seqlen: int = None,
+        seqlens: Optional[list[int]] = None,


ywang96 · 2025-03-11T04:33:23Z

vllm/model_executor/models/qwen2_vl.py

+        max_seqlen: int,
+        seqlens: list[int],


ywang96 · 2025-03-11T04:34:58Z

vllm/model_executor/models/qwen2_5_vl.py

+        max_seqlen: int,
+        seqlens: list[int],


I think it's probably a good idea to add a small documentation here to indicate that max_seqlen is only used for FA and seqlens is only used to xformers.

Signed-off-by: cynthieye <987073381@qq.com>

Signed-off-by: cynthieye <987073381@qq.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: cynthieye <987073381@qq.com>

Signed-off-by: cynthieye <987073381@qq.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

DarkLight1337 requested review from Isotr0py and ywang96 March 7, 2025 06:41

Isotr0py approved these changes Mar 7, 2025

View reviewed changes

vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved

vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved

cynthieye changed the title ~~feat:Optimize qwen2-vl to reduce cudaMemcpyAsync~~ [Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync Mar 10, 2025

cynthieye force-pushed the main branch 3 times, most recently from ae09649 to 1fbb69c Compare March 10, 2025 06:53

Isotr0py enabled auto-merge (squash) March 10, 2025 09:50

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2025

auto-merge was automatically disabled March 10, 2025 13:26
Head branch was pushed to by a user without write access

cynthieye force-pushed the main branch from a4d7e3a to 37e543a Compare March 10, 2025 13:26

[Perf]: Optimize qwen2-vl to reduce cudaMemcpyAsync

347de39

Signed-off-by: cynthieye <987073381@qq.com>

cynthieye force-pushed the main branch from 37e543a to 347de39 Compare March 10, 2025 13:29

cynthieye mentioned this pull request Mar 10, 2025

[CI failed]: V1 Test Failed due to "No available memory for the cache blocks" in GitHub Actions #14574

Open

1 task

empty test

fd105c1

Signed-off-by: cynthieye <987073381@qq.com>

ywang96 approved these changes Mar 11, 2025

View reviewed changes

cynthieye added 3 commits March 11, 2025 13:12

[Perf]: Fix formatting issues

9959792

Signed-off-by: cynthieye <987073381@qq.com>

Merge remote-tracking branch 'upstream/main'

c03f59d

[Perf]: Fix formatting issues

ddb8dd3

Signed-off-by: cynthieye <987073381@qq.com>

ywang96 enabled auto-merge (squash) March 11, 2025 06:25

ywang96 merged commit 70b808f into vllm-project:main Mar 11, 2025
33 checks passed

This was referenced Mar 20, 2025

[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation #15200

Merged

[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL #15273

Merged

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync (vllm-project#14377)

d468e24

Signed-off-by: cynthieye <987073381@qq.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync (vllm-project#14377)

8ece569

Signed-off-by: cynthieye <987073381@qq.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync (vllm-project#14377)

21ac3af

Signed-off-by: cynthieye <987073381@qq.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377

Uh oh!

cynthieye commented Mar 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

Isotr0py left a comment

Uh oh!

Uh oh!

Uh oh!

ywang96 commented Mar 11, 2025

Uh oh!

ywang96 left a comment

Uh oh!

ywang96 Mar 11, 2025

Uh oh!

ywang96 Mar 11, 2025

Uh oh!

ywang96 Mar 11, 2025

Uh oh!

ywang96 Mar 11, 2025

Uh oh!

ywang96 Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377

[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377

Uh oh!

Conversation

cynthieye commented Mar 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ywang96 commented Mar 11, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cynthieye commented Mar 6, 2025 •

edited by github-actions bot

Loading