-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Model][VLM] Add Qwen2-VL model support #7905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Add Qwen2-VL support in chat_utils.py.
…ties in a single batch.
|
Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly. In the meantime, can you fix the CI failures? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fyabc Thank you for contributing to vLLM! I took a brief took and left a first round of review. Please take a look.
As @DarkLight1337 mentioned, we might want to wait for #7559 to be merged first because as we're going to have a model that supports a mix of modalities, we want to be careful with API changes.
| # special processing for mrope position deltas. | ||
| if self.runner.model_is_mrope: | ||
| image_grid_thw = mm_kwargs.get("image_grid_thw", None) | ||
| video_grid_thw = mm_kwargs.get("video_grid_thw", None) | ||
| assert image_grid_thw is not None or video_grid_thw is not None, \ | ||
| "mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'." | ||
|
|
||
| hf_config = self.runner.model_config.hf_config | ||
|
|
||
| from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding | ||
|
|
||
| inter_data.mrope_input_positions = [None] * inter_data.n_seqs | ||
| for seq_idx in range(inter_data.n_seqs): | ||
| seq_data = seq_group_metadata.seq_data[ | ||
| inter_data.seq_ids[seq_idx]] | ||
| token_ids = seq_data.get_token_ids() | ||
|
|
||
| mrope_input_positions, mrope_position_delta = MRotaryEmbedding.get_input_positions( | ||
| token_ids, | ||
| image_grid_thw=image_grid_thw, | ||
| video_grid_thw=video_grid_thw, | ||
| image_token_id=hf_config.image_token_id, | ||
| video_token_id=hf_config.video_token_id, | ||
| vision_start_token_id=hf_config.vision_start_token_id, | ||
| vision_end_token_id=hf_config.vision_end_token_id, | ||
| spatial_merge_size=hf_config.vision_config. | ||
| spatial_merge_size, | ||
| context_len=inter_data.context_lens[seq_idx], | ||
| ) | ||
|
|
||
| seq_data.mrope_position_delta = mrope_position_delta | ||
| inter_data.mrope_input_positions[ | ||
| seq_idx] = mrope_input_positions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with us doing this at the model runner level and I'm honestly sure if there's a better place to apply mrope. What's your thought on this? @WoosukKwon
Can you merge from |
|
Hi @DarkLight1337 @ywang96 , I have updated this PR based on your review comments, please check it again. |
# Conflicts: # vllm/worker/model_runner.py
|
@fyabc Hi, can this patch support mutiple images in one prompt like follows: |
Hi @DragonFive , you can pass multiple images into a single prompt like this: messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]See "Multi image inference" section of our README for more details. |
try this: "type": "image_url", |
|
Hi, I am running this and got the same error as #8281. Could someone help me with this? Thank you! Unrecognized keys in
|
|
See my comment above: #7905 (comment) |
|
This version of transformer will raise the following error: |
|
T
The current version of vLLM requires |
|
@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug. |
this transformers version is not compatible with the latest VLLM anymore. (mllama missing). I tried this using transformers after this fix huggingface/transformers#33753 but vllm is still throwing assert "factor" in rope_scaling |
|
Yeah, you need to install vLLM from source to fix the problem now. Please refer to the top post in this thread. |
@chenzhengda Hi, by default all mm placeholders are joined with "\n" separator (see |
Thanks for pointing that out. We have this on our multimodality plan but haven't gotten around to implementing it yet. Since many HF chat templates do not specify how to combine placeholder multimodal tokens (like |
|
Does qwen2vl deployed using vllm support function call? |
+1 |
For those of who want a temporary fix for this, here's how I do it, then reinstall vllm. Also expect an official fix soon. |
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Alvant <alvasian@yandex.ru>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
|
EDIT: Nevermind, I just had a silly issue where Hey @fyabc I am working on expanding quantization for multimodal models and currently this special case in the qwen2vl weight loading is causing issues vllm/vllm/model_executor/models/qwen2_vl.py Lines 1174 to 1190 in ab6f981
Could you offer insight into why this is required and if we could apply a transformation on the inputs rather than the weights? |
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>


This PR adding support for Qwen2-VL model.
FIX #8139
FIX #8281
Requirements
This PR requirestransformerswith this PR merged and this bugfix PR merged (You can install it viapip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830).NOTE: Current latesttransformersversion have a bug, so you should install a develop version as above now.For transformers>=4.45, please install vLLM from source.transformers>=4.45, please installvllm>=0.6.3.Optional Requirements
qwen-vl-utilsto preprocess multimodal content correctly (qwen-vl-utilsis not a part of this PR).Example Usage
Notes
Here are some important notes about this PR:
Qwen2-VL uses rotary embedding with multimodal sections (
mrope) (seevllm/model_executor/layers/rotary_embedding.pyfor more details). This rotary embedding requires the inputpositionsto be a tensor of shape(3, seq_len)(instead of(seq_len,)in common case)._mrope_position_delta(with typeOptional[int]) attribute intovllm.sequence.SequenceData(this attribute is used to computemrope_input_positionsin each decoding step). (If reviewers have a better solution, please comment in this PR)model_runner.pyto compute themrope_input_positionswhen the model usesmrope. Other model runners should also follow this logic, I think this can be done in another PR (I will add this part if reviewers thinks it needs to be implemented in this PR).Qwen2-VL uses
flash-attn==2.6.1(instead ofvllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 invllm/model_executor/models/qwen2_vl.py). Currentvllm-flash-attnversion will outputNaNlogits value, and I am still debugging this bug.xformersbackend as a fallback implementation ofQwen2VisionAttention, so there is no need to addflash-attninto project requirements file.Qwen2-VL supports both image and video inputs. To support this feature, we add a
videomultimodal plugin (seevllm/multimodal/video.pyfor more details).OpenAI-compatible server
vllm.entrypoints.openai.api_serveruses a model-independent multimodal data fetcher (e.g.vllm.multimodal.utils.async_get_and_parse_image), so vision smart resizing logic inqwen-vl-utilscannot be applied now. I think its good to create another PR to fix it later.Multiple modalities support details
Since Qwen2-VL support two modalities (images and videos), we should handle some special cases as below:
So I remove the key same check in
vllm.multimodal.base.MultiModalInputs.batch()method, since different samples may returns different modality keys.