Skip to content

Commit c0dbe09

Browse files
JJJYmmmCyrilvallez
andauthored
Adding Support for Qwen3-VL Series (#40795)
* add qwen3vl series * make fixup * fix import * re-protect import * fix it finally (need to merge main into the branch) * skip processor test (need the checkpoint) * oups typo * simplify modular * remove unecesary attr * fix layer * remove unused rope_deltas args * reuse image def * remove unnesesary imports --------- Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
1 parent fc5f910 commit c0dbe09

27 files changed

+8039
-7
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1127,6 +1127,10 @@
11271127
title: Qwen2Audio
11281128
- local: model_doc/qwen2_vl
11291129
title: Qwen2VL
1130+
- local: model_doc/qwen3_vl
1131+
title: Qwen3VL
1132+
- local: model_doc/qwen3_vl_moe
1133+
title: Qwen3VLMoe
11301134
- local: model_doc/sam2
11311135
title: SAM2
11321136
- local: model_doc/sam2_video
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
*This model was released on None and added to Hugging Face Transformers on 2025-08-16.*
17+
18+
<div style="float: right;">
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div>
23+
</div>
24+
25+
# Qwen3-VL
26+
27+
[Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
28+
29+
Model usage
30+
31+
<hfoptions id="usage">
32+
<hfoption id="AutoModel">
33+
34+
```py
35+
import torch
36+
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
37+
38+
model = Qwen3VLForConditionalGeneration.from_pretrained(
39+
"Qwen/Qwen3-VL",
40+
dtype=torch.float16,
41+
device_map="auto",
42+
attn_implementation="sdpa"
43+
)
44+
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL")
45+
messages = [
46+
{
47+
"role":"user",
48+
"content":[
49+
{
50+
"type":"image",
51+
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
52+
},
53+
{
54+
"type":"text",
55+
"text":"Describe this image."
56+
}
57+
]
58+
}
59+
60+
]
61+
62+
inputs = processor.apply_chat_template(
63+
messages,
64+
tokenize=True,
65+
add_generation_prompt=True,
66+
return_dict=True,
67+
return_tensors="pt",
68+
)
69+
inputs.pop("token_type_ids", None)
70+
71+
generated_ids = model.generate(**inputs, max_new_tokens=128)
72+
generated_ids_trimmed = [
73+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
74+
]
75+
output_text = processor.batch_decode(
76+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
77+
)
78+
print(output_text)
79+
```
80+
</hfoption>
81+
</hfoptions>
82+
83+
## Qwen3VLConfig
84+
85+
[[autodoc]] Qwen3VLConfig
86+
87+
## Qwen3VLTextConfig
88+
89+
[[autodoc]] Qwen3VLTextConfig
90+
91+
## Qwen3VLProcessor
92+
93+
[[autodoc]] Qwen3VLProcessor
94+
95+
## Qwen3VLVideoProcessor
96+
97+
[[autodoc]] Qwen3VLVideoProcessor
98+
99+
## Qwen3VLVisionModel
100+
101+
[[autodoc]] Qwen3VLVisionModel
102+
- forward
103+
104+
## Qwen3VLTextModel
105+
106+
[[autodoc]] Qwen3VLTextModel
107+
- forward
108+
109+
## Qwen3VLModel
110+
111+
[[autodoc]] Qwen3VLModel
112+
- forward
113+
114+
## Qwen3VLForConditionalGeneration
115+
116+
[[autodoc]] Qwen3VLForConditionalGeneration
117+
- forward
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
*This model was released on None and added to Hugging Face Transformers on 2025-08-17.*
17+
18+
<div style="float: right;">
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div>
23+
</div>
24+
25+
# Qwen3-VL-Moe
26+
27+
[Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
28+
29+
Model usage
30+
31+
<hfoptions id="usage">
32+
<hfoption id="AutoModel">
33+
34+
```py
35+
import torch
36+
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
37+
38+
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
39+
"Qwen/Qwen3-VL-Moe",
40+
dtype=torch.float16,
41+
device_map="auto",
42+
attn_implementation="sdpa"
43+
)
44+
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Moe")
45+
messages = [
46+
{
47+
"role":"user",
48+
"content":[
49+
{
50+
"type":"image",
51+
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
52+
},
53+
{
54+
"type":"text",
55+
"text":"Describe this image."
56+
}
57+
]
58+
}
59+
60+
]
61+
62+
inputs = processor.apply_chat_template(
63+
messages,
64+
tokenize=True,
65+
add_generation_prompt=True,
66+
return_dict=True,
67+
return_tensors="pt",
68+
)
69+
inputs.pop("token_type_ids", None)
70+
71+
generated_ids = model.generate(**inputs, max_new_tokens=128)
72+
generated_ids_trimmed = [
73+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
74+
]
75+
output_text = processor.batch_decode(
76+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
77+
)
78+
print(output_text)
79+
```
80+
</hfoption>
81+
</hfoptions>
82+
83+
## Qwen3VLMoeConfig
84+
85+
[[autodoc]] Qwen3VLMoeConfig
86+
87+
## Qwen3VLMoeTextConfig
88+
89+
[[autodoc]] Qwen3VLMoeTextConfig
90+
91+
## Qwen3VLMoeVisionModel
92+
93+
[[autodoc]] Qwen3VLMoeVisionModel
94+
- forward
95+
96+
## Qwen3VLMoeTextModel
97+
98+
[[autodoc]] Qwen3VLMoeTextModel
99+
- forward
100+
101+
## Qwen3VLMoeModel
102+
103+
[[autodoc]] Qwen3VLMoeModel
104+
- forward
105+
106+
## Qwen3VLMoeForConditionalGeneration
107+
108+
[[autodoc]] Qwen3VLMoeForConditionalGeneration
109+
- forward

src/transformers/models/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,8 @@
278278
from .qwen3 import *
279279
from .qwen3_moe import *
280280
from .qwen3_next import *
281+
from .qwen3_vl import *
282+
from .qwen3_vl_moe import *
281283
from .rag import *
282284
from .recurrent_gemma import *
283285
from .reformer import *

src/transformers/models/auto/configuration_auto.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,10 @@
325325
("qwen3", "Qwen3Config"),
326326
("qwen3_moe", "Qwen3MoeConfig"),
327327
("qwen3_next", "Qwen3NextConfig"),
328+
("qwen3_vl", "Qwen3VLConfig"),
329+
("qwen3_vl_moe", "Qwen3VLMoeConfig"),
330+
("qwen3_vl_moe_text", "Qwen3VLMoeTextConfig"),
331+
("qwen3_vl_text", "Qwen3VLTextConfig"),
328332
("rag", "RagConfig"),
329333
("realm", "RealmConfig"),
330334
("recurrent_gemma", "RecurrentGemmaConfig"),
@@ -764,6 +768,10 @@
764768
("qwen3", "Qwen3"),
765769
("qwen3_moe", "Qwen3MoE"),
766770
("qwen3_next", "Qwen3Next"),
771+
("qwen3_vl", "Qwen3VL"),
772+
("qwen3_vl_moe", "Qwen3VLMoe"),
773+
("qwen3_vl_moe_text", "Qwen3VLMoe"),
774+
("qwen3_vl_text", "Qwen3VL"),
767775
("rag", "RAG"),
768776
("realm", "REALM"),
769777
("recurrent_gemma", "RecurrentGemma"),
@@ -952,6 +960,8 @@
952960
("internvl_vision", "internvl"),
953961
("qwen2_5_vl_text", "qwen2_5_vl"),
954962
("qwen2_vl_text", "qwen2_vl"),
963+
("qwen3_vl_text", "qwen3_vl"),
964+
("qwen3_vl_moe_text", "qwen3_vl_moe"),
955965
("sam_vision_model", "sam"),
956966
("sam2_vision_model", "sam2"),
957967
("sam2_hiera_det_model", "sam2"),

src/transformers/models/auto/image_processing_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@
156156
("pvt_v2", ("PvtImageProcessor", "PvtImageProcessorFast")),
157157
("qwen2_5_vl", ("Qwen2VLImageProcessor", "Qwen2VLImageProcessorFast")),
158158
("qwen2_vl", ("Qwen2VLImageProcessor", "Qwen2VLImageProcessorFast")),
159+
("qwen3_vl", ("Qwen2VLImageProcessor", "Qwen2VLImageProcessorFast")),
159160
("regnet", ("ConvNextImageProcessor", "ConvNextImageProcessorFast")),
160161
("resnet", ("ConvNextImageProcessor", "ConvNextImageProcessorFast")),
161162
("rt_detr", ("RTDetrImageProcessor", "RTDetrImageProcessorFast")),

src/transformers/models/auto/modeling_auto.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,10 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
319319
("qwen3", "Qwen3Model"),
320320
("qwen3_moe", "Qwen3MoeModel"),
321321
("qwen3_next", "Qwen3NextModel"),
322+
("qwen3_vl", "Qwen3VLModel"),
323+
("qwen3_vl_moe", "Qwen3VLMoeModel"),
324+
("qwen3_vl_moe_text", "Qwen3VLMoeTextModel"),
325+
("qwen3_vl_text", "Qwen3VLTextModel"),
322326
("recurrent_gemma", "RecurrentGemmaModel"),
323327
("reformer", "ReformerModel"),
324328
("regnet", "RegNetModel"),
@@ -974,6 +978,8 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
974978
("pix2struct", "Pix2StructForConditionalGeneration"),
975979
("qwen2_5_vl", "Qwen2_5_VLForConditionalGeneration"),
976980
("qwen2_vl", "Qwen2VLForConditionalGeneration"),
981+
("qwen3_vl", "Qwen3VLForConditionalGeneration"),
982+
("qwen3_vl_moe", "Qwen3VLMoeForConditionalGeneration"),
977983
("video_llava", "VideoLlavaForConditionalGeneration"),
978984
("vipllava", "VipLlavaForConditionalGeneration"),
979985
("vision-encoder-decoder", "VisionEncoderDecoderModel"),
@@ -1028,6 +1034,8 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
10281034
("pixtral", "LlavaForConditionalGeneration"),
10291035
("qwen2_5_vl", "Qwen2_5_VLForConditionalGeneration"),
10301036
("qwen2_vl", "Qwen2VLForConditionalGeneration"),
1037+
("qwen3_vl", "Qwen3VLForConditionalGeneration"),
1038+
("qwen3_vl_moe", "Qwen3VLMoeForConditionalGeneration"),
10311039
("shieldgemma2", "Gemma3ForConditionalGeneration"),
10321040
("smolvlm", "SmolVLMForConditionalGeneration"),
10331041
("udop", "UdopForConditionalGeneration"),

src/transformers/models/auto/processing_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,8 @@
120120
("qwen2_5_vl", "Qwen2_5_VLProcessor"),
121121
("qwen2_audio", "Qwen2AudioProcessor"),
122122
("qwen2_vl", "Qwen2VLProcessor"),
123+
("qwen3_vl", "Qwen3VLProcessor"),
124+
("qwen3_vl_moe", "Qwen3VLProcessor"),
123125
("sam", "SamProcessor"),
124126
("sam2", "Sam2Processor"),
125127
("sam_hq", "SamHQProcessor"),

src/transformers/models/auto/tokenization_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -583,6 +583,8 @@
583583
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
584584
),
585585
),
586+
("qwen3_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
587+
("qwen3_vl_moe", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
586588
("rag", ("RagTokenizer", None)),
587589
("realm", ("RealmTokenizer", "RealmTokenizerFast" if is_tokenizers_available() else None)),
588590
(

src/transformers/models/auto/video_processing_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@
5656
("qwen2_5_omni", "Qwen2VLVideoProcessor"),
5757
("qwen2_5_vl", "Qwen2VLVideoProcessor"),
5858
("qwen2_vl", "Qwen2VLVideoProcessor"),
59+
("qwen3_vl", "Qwen3VLVideoProcessor"),
60+
("qwen3_vl_moe", "Qwen3VLVideoProcessor"),
5961
("sam2_video", "Sam2VideoVideoProcessor"),
6062
("smolvlm", "SmolVLMVideoProcessor"),
6163
("video_llava", "VideoLlavaVideoProcessor"),

0 commit comments

Comments
 (0)