Qwen2vl vision encoder fix #2365

jakep-allenai · 2024-12-05T19:41:40Z

Potential fix for #2112

Motivation

Users have reported worse performance running qwen2-vl in sglang and vllm than with transformers. I have identified a few cases of different calculations in the vision encoder. And now, it should be matching perfectly.

Currently this is a draft PR, because performance is reduced by half roughly.

Modifications

Reverted to exactly the same QuickGELU implementation as HF transformers.
Fixed weird casting issues with attention module used in vision network
Going back to Torch SDPA

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

merrymercy

Thanks for investigating this!

merrymercy · 2024-12-06T08:52:17Z

python/sglang/srt/models/qwen2_vl.py

@@ -30,10 +30,12 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from einops import rearrange, repeat
+from vllm.config import CacheConfig, MultiModalConfig


remove unused imports

merrymercy · 2024-12-06T08:52:26Z

python/sglang/srt/models/qwen2_vl.py

 from vllm.distributed import parallel_state
 from vllm.distributed import utils as dist_utils
 from vllm.logger import init_logger
 from vllm.model_executor.layers.activation import QuickGELU
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader


remove unused imports

merrymercy · 2024-12-06T08:56:33Z

python/sglang/srt/models/qwen2_vl.py

+        q = q.squeeze(0)
+        k = k.squeeze(0)
+        v = v.squeeze(0)
+        output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)


I think pytorch SDPA should also be fast, the probably is probably how you prepare the attention mask?
Can you vectorize the code more, use less Python for-loop, or write a triton kernel for it (see example), or catch the results so we can reuse it across layers?

Hmm, I have tried caching the attention mask, but it doesn't seem to impact performance much.

The issue I see is that for one layer, the context_attention_fwd kernel in sglang matches torch's scaled_dot_product_attention pretty closely, within 1e-2 for each activation. But, in qwen2vl, there are 32 layers, and after a while, the absolute difference accumulates higher, closer to +/- 1.0 max absolute difference in the activations.

merrymercy · 2024-12-06T08:56:45Z

python/sglang/srt/models/qwen2_vl.py

+    """
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return input * torch.sigmoid(1.702 * input)


try torch.compile to fuse them?

merrymercy · 2024-12-06T08:59:25Z

test/srt/test_qwen2vl.py

+QWEN2_VL_MODEL = "Qwen/Qwen2-VL-7B-Instruct"
+
+
+class RawSGLangTest(unittest.IsolatedAsyncioTestCase):


This looks good. Maybe give it a better name.

Now we have a very good reference script for text-only models and a very good model support guide:
https://github.com/sgl-project/sglang/blob/main/scripts/playground/reference_hf.py
https://sgl-project.github.io/references/supported_models.html#how-to-support-a-new-model

Are you willing to help here to add some scripts/docs similar to the above ones, but for vision language models?

merrymercy · 2024-12-17T12:16:41Z

@jakep-allenai Do you have any updates on this? Qwen2vl is a very popular model so we would like to fix it soon.

jakep-allenai · 2024-12-17T16:46:45Z

No, my implementation with F.fused_dot_product_attention was still 1/2 the speed after caching, and even then, I never heard back from @Mr-Loevan about rerunning his benchmark to see if it would fix his reported issue. On our side, we found no significant difference in user-preference of generations with vllm (which used the xformers backend) or with sglang.

My current theory is that the memory-efficient attention implementation in sglang is accurate enough for a single layer, but small errors will accumulate for a typical 30+ layer full network.

jakep-allenai added 5 commits December 5, 2024 09:21

Unit test for mismatched visual network

a14c88c

Merge branch 'main' of https://github.com/sgl-project/sglang into main

a88f0fc

Changes which pass unit test, need to be cleaned

8ac651c

Cleaner implementation

d81cc75

Formatting

5402c43

jakep-allenai requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners December 5, 2024 19:41

jakep-allenai marked this pull request as draft December 5, 2024 20:51

yizhang2077 self-assigned this Dec 6, 2024

zhyncs force-pushed the main branch from fc6387e to 64fceab Compare December 6, 2024 06:14

merrymercy reviewed Dec 6, 2024

View reviewed changes

merrymercy force-pushed the main branch from 1ad76cd to 835f8af Compare December 9, 2024 07:31

merrymercy added the await-response label Dec 17, 2024

This was referenced Dec 17, 2024

Add support for Phi3V #2383

Draft

Support for Pixtral model (Mistral) #2381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2vl vision encoder fix #2365

Qwen2vl vision encoder fix #2365

jakep-allenai commented Dec 5, 2024 •

edited

Loading

merrymercy left a comment

merrymercy Dec 6, 2024

merrymercy Dec 6, 2024

merrymercy Dec 6, 2024

jakep-allenai Dec 9, 2024

merrymercy Dec 6, 2024

merrymercy Dec 6, 2024

merrymercy commented Dec 17, 2024

jakep-allenai commented Dec 17, 2024

		QWEN2_VL_MODEL = "Qwen/Qwen2-VL-7B-Instruct"


		class RawSGLangTest(unittest.IsolatedAsyncioTestCase):

Qwen2vl vision encoder fix #2365

Are you sure you want to change the base?

Qwen2vl vision encoder fix #2365

Conversation

jakep-allenai commented Dec 5, 2024 • edited Loading

Motivation

Modifications

Checklist

merrymercy left a comment

Choose a reason for hiding this comment

merrymercy Dec 6, 2024

Choose a reason for hiding this comment

merrymercy Dec 6, 2024

Choose a reason for hiding this comment

merrymercy Dec 6, 2024

Choose a reason for hiding this comment

jakep-allenai Dec 9, 2024

Choose a reason for hiding this comment

merrymercy Dec 6, 2024

Choose a reason for hiding this comment

merrymercy Dec 6, 2024

Choose a reason for hiding this comment

merrymercy commented Dec 17, 2024

jakep-allenai commented Dec 17, 2024

jakep-allenai commented Dec 5, 2024 •

edited

Loading