You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
28
+
29
+
Model usage
30
+
31
+
<hfoptionsid="usage">
32
+
<hfoptionid="AutoModel">
33
+
34
+
```py
35
+
import torch
36
+
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
37
+
38
+
model = Qwen3VLForConditionalGeneration.from_pretrained(
[Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
28
+
29
+
Model usage
30
+
31
+
<hfoptionsid="usage">
32
+
<hfoptionid="AutoModel">
33
+
34
+
```py
35
+
import torch
36
+
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
37
+
38
+
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
0 commit comments