Support qwen2 vl #2689

drbh · 2024-10-24T20:02:49Z

This is a work in progress PR to support qwen2-vl. Currently these changes include loading the model weights and a functioning vision model. Remaining work is to adjust the existing qwen2 model to handle multimodal requests/positional embeddings.

status:

load weights
support prefill warmup run
accept chat image request and process image
align vision model output with reference impl
correctly merge the processed image and text model
avoid any reshapes and allocations during runtime possible

remaining:

resolve remaining bug with position ids
align test output with reference
cleanup remaining todos/refactors/improvements

further

make improvements
improve test coverage

Narsil · 2024-10-25T04:20:09Z

server/text_generation_server/layers/tensor_parallel.py

@@ -144,7 +144,7 @@ def load_qkv(
            num_key_value_heads=num_key_value_heads,
        )
        if bias:
-            raise NotImplementedError("packed_qkv only implemented for baichuan")
+            bias = weights.get_tensor(f"{prefix}.bias")


This is wrong, no ?

We get the whole bias on Row Parallel, for column, you need to take the actual slice, which for qkv you need to follow the same layout as the weights I think (except on dim=0 instead of dim=1)

after comparing with transformers it seems like weights.get_tensor(f"{prefix}.bias") and weights.get_sharded(f"{prefix}.qkv.bias", dim=0) return the exact bias as the one in the reference.

I've reverted the change within tensor_parallel.py::load_qkv in favor of setting the bias after creating the linear in qwen_vl.py via weights.get_sharded(f"{prefix}.qkv.bias", dim=0)

for reference:

self.qkv = TensorParallelColumnLinear.load_qkv( config, prefix=f"{prefix}.qkv", weights=weights, bias=False, num_heads=self.num_heads, num_key_value_heads=self.num_heads, ) self.qkv.linear.bias = weights.get_sharded(f"{prefix}.qkv.bias", dim=0)

this hopefully makes the qkv loading a bit more clear

Narsil · 2024-10-25T04:26:30Z

server/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py

-        self.rotary_emb(query, torch.select(kv, dim=1, index=0), cos, sin)
+        # TODO: correctly handle the multimodal case
+        if False:
+            self.rotary_emb(query, torch.select(kv, dim=1, index=0), cos, sin)


This should be correct. Ignore tiny differences there, this code is exactly what you have underneath (and much more efficient).

I read transformers comment on this, it seems from what I'm reading that they are just applying part of the tensors there, so a regular slicing should do the work.

The problem with the other part, is that our cos, sin are layed out differently than theirs, so you're gonna have issues keeping transformers code and merge it with our own.

Narsil · 2024-10-25T04:27:41Z

server/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py

-        hidden_states = self.embed_tokens(input_ids)
+
+        # if inputs_embeds are supplied from an external model (vision model) then avoid embedding input_ids
+        if inputs_embeds is not None:


No, remove this.

The way it's done here, is that we life input_ids to the parent class, and this always takes input_embeds. Makes signatures much cleaner. (It's already done for llama this way if you want to check)

yes agreed this is much cleaner. I've updated the classes in the latest commit to follow the same pattern as llama

Narsil · 2024-10-25T04:27:49Z

server/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py

@@ -306,12 +335,24 @@ def forward(
        max_s: int,
        true_max_s: int,
        prefill_cache_indices: Optional[torch.Tensor],
+        inputs_embeds: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,


No attention_mask

removed in latest commit

Narsil · 2024-10-25T04:30:11Z

server/text_generation_server/models/custom_modeling/qwen2_vl.py

+                image_mask = (
+                    (input_ids == self.image_token_id)
+                    .unsqueeze(-1)
+                    .expand_as(inputs_embeds)
+                    .to(inputs_embeds.device)
+                )
+                image_embeds = image_embeds.to(
+                    inputs_embeds.device, inputs_embeds.dtype
+                )
+                # input embeddings are masked with image embeddings
+                inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)


Why does

inputs_embeds[input_ids == self.image_token_id] = image_embeds

Doesn't work ?

ha yea thats a much better way to write this! I've included this change along with a overall better rewrite of this logic in the latest commit.

Narsil · 2024-10-25T04:31:05Z

server/text_generation_server/models/custom_modeling/qwen2_vl.py

+        attention_mask = torch.ones_like(
+            input_ids, dtype=torch.bool, device=input_ids.device
+        )
+        inputs_embeds = self.text_model.embed_tokens(input_ids)


When you lift this, this will be gone from the text_model and be here directly.

updated in latest commit

Narsil · 2024-10-25T04:34:45Z

server/text_generation_server/models/custom_modeling/qwen2_vl.py

+            image_index, video_index = 0, 0
+
+            for i, input_ids in enumerate(total_input_ids):
+                if attention_mask is not None:


This whole thing is extremely poor code (lots of loops, lots of CPU/GPU back&forth).

I think ditching it altogether will be easier than trying to adapt it.

yea agreed, in the latest commit i've moved this logic into a function get_position_ids and rewritten a more simple version that avoids most of the gpu/cpu copies. I'll revisit later to see if I can simplify further (avoid the loop) but the changes may provide a bit of a performance and readability improvement.

HuggingFaceDocBuilderDev · 2024-10-28T14:38:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…atch

…o signatures

Narsil · 2024-10-30T15:11:39Z

server/text_generation_server/models/vlm_causal_lm.py

+                position_ids = position_ids.repeat(3, 1, 1).clone()
+                batch.position_ids = position_ids[0, 0, :]


This seems very wrong, no ?

Narsil · 2024-10-30T15:12:01Z

server/text_generation_server/models/vlm_causal_lm.py

+                position_ids = self.model.get_position_ids(
+                    input_ids.unsqueeze(0), batch.image_grid_thw
+                )
+                batch.position_ids = position_ids[0, 0, :]


Why create so many position ids, just to discard them ?

Narsil · 2024-10-30T15:27:27Z

server/text_generation_server/models/custom_modeling/qwen2_vl.py

+        )
+        self.device = weights.device
+
+    def get_position_ids(


Seems overly complex and bloated.

Let's keep this if it's working, but it's definitely fixable I think

Narsil · 2024-10-30T15:27:51Z

server/text_generation_server/models/vlm_causal_lm.py

+        if hasattr(self.model, "get_position_ids"):
+            if position_ids.shape[0] != 1:
+                position_ids = self.model.get_position_ids(
+                    input_ids.unsqueeze(0), batch.image_grid_thw


No unsqueeze, fix the function.

drbh · 2024-10-30T16:40:47Z

merging to enable qwen2-vl

Will follow up with an improvement PR soon. specifically:

improve decode
improve vision head
improve batch to handle multi dimensional position ids
remove complex position logic if possible

Narsil reviewed Oct 25, 2024

View reviewed changes

drbh added 12 commits October 28, 2024 12:30

feat: add support for qwen2 vl model

d96eef2

feat: fix token padding, enable warmup and process basic request

09ac4fb

fix: improve get_position_ids, add lift embed_tokens

22fdf93

fix: remove get_cos_sin_hack dev function

ec93328

feat: add simple test chat with meesage and text

80ea4f0

fix: lint test

e1114c2

fix: adjust positional embeddings for multi dimensional position ids

279b114

fix: update docs and lint unused vars

670d75b

fix: include linted file

aa2aa9f

fix: add norm after text output

65558b3

fix: format model file

6208d10

fix: adjust for ruff lints

f2a1b1b

drbh force-pushed the support-qwen2-vl branch from d1bc32b to f2a1b1b Compare October 28, 2024 16:30

drbh added 6 commits October 28, 2024 16:33

fix: remove unused rotate_half

831a07f

feat: refactors and calc num features

fb1ae6d

fix: prefer position_ids passed from vlm causal lm and reset ids on b…

77c81a2

…atch

fix: adjust get_position_ids if not available and add required args t…

4f90db4

…o signatures

fix: adjust resize case for qwen2_vl warmup

77eb07f

fix: avoid qwen2 vl specific paths with qwen2

620769e

drbh marked this pull request as ready for review October 29, 2024 17:50

drbh requested a review from Narsil October 30, 2024 14:23

Narsil reviewed Oct 30, 2024

View reviewed changes

Narsil approved these changes Oct 30, 2024

View reviewed changes

drbh merged commit befd9f6 into main Oct 30, 2024
12 of 13 checks passed

drbh deleted the support-qwen2-vl branch October 30, 2024 16:40

drbh mentioned this pull request Oct 30, 2024

fix cuda graphs for qwen2-vl #2708

Merged

		position_ids = position_ids.repeat(3, 1, 1).clone()
		batch.position_ids = position_ids[0, 0, :]

Support qwen2 vl #2689

Support qwen2 vl #2689

Uh oh!

Conversation

drbh commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Oct 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drbh commented Oct 30, 2024

Uh oh!

Uh oh!

Uh oh!

drbh commented Oct 24, 2024 •

edited

Loading