model: support qwen3-vl series #10323

zju-stu-lizheng · 2025-09-11T12:25:49Z

This PR introduces support for the upcoming Qwen3-VL models — including both dense and MoE variants, as well as Instruct and Thinking editions. As the next generation of the Qwen-VL family, Qwen3-VL delivers significant advancements in visual understanding while maintaining robust pure-text performance, achieving state-of-the-art results across complex multimodal benchmarks.

Core implementation details can be also found in the corresponding PR in Transformers repo:
🔗 huggingface/transformers#40795

Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: cao1zhg <653506626@qq.com>

gemini-code-assist

Summary of Changes

Hello @zju-stu-lizheng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest Qwen3-VL multimodal models, encompassing both standard and Mixture-of-Experts configurations. The changes enable the system to process and leverage advanced visual understanding capabilities, enhancing its performance on complex multimodal benchmarks. The core objective is to broaden the range of supported state-of-the-art multimodal large language models.

Highlights

New Model Support: Introduces comprehensive support for the upcoming Qwen3-VL series, including both its dense and Mixture-of-Experts (MoE) variants, as well as Instruct and Thinking editions. This expands the multimodal capabilities of the system.
Deepstack Embedding Integration: Adds support for 'deepstack' embeddings within the multimodal utility functions, allowing for more sophisticated integration of visual features into the language model's hidden states. This involves modifying the embed_mm_inputs and general_mm_embed_routine functions to handle and pass deepstack-specific information.
Vision Encoder Components: New Python modules are added to define the Qwen3-VL's vision encoder architecture, including Qwen3_VisionPatchEmbed, Qwen3_VisionMLP, Qwen3_VisionBlock, Qwen3_VisionPatchMerger, and the overarching Qwen3_VisionTransformer.
Rotary Embedding Updates: The rotary embedding logic has been updated to explicitly include support for qwen3_vl and qwen3_vl_moe model types, ensuring correct positional encoding for these new models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the Qwen3-VL series of models. The changes primarily involve adding new model definitions for both dense and MoE variants and updating the surrounding infrastructure to handle them, including support for deepstack embeddings. The implementation appears to be a solid extension of the existing Qwen-VL support. I've identified a few areas for improvement, including a critical bug in an assert statement, a potential AttributeError in the MoE model, use of a magic number, and several maintainability issues like code duplication and incorrect type hints. Addressing these points will enhance the robustness and clarity of the new model support.

python/sglang/srt/models/qwen3_vl.py

python/sglang/srt/models/qwen3_vl_moe.py

python/sglang/srt/managers/mm_utils.py

python/sglang/srt/models/qwen3_vl.py

python/sglang/srt/models/qwen3_vl_moe.py

python/sglang/srt/multimodal/processors/qwen_vl.py

ocss884 · 2025-09-11T12:45:24Z

@yizhang2077 @yhyang201 @mickqian @zhyncs @Ying1123 @merrymercy

python/sglang/srt/managers/mm_utils.py

python/sglang/srt/models/qwen3_vl.py

python/sglang/srt/models/qwen3_vl_moe.py

python/sglang/srt/multimodal/processors/qwen_vl.py

Alexhaoge · 2025-09-19T03:27:43Z

Does the current implementation supports expert parallelism for Qwen3VL MoE models? I try to launch the server with --tp-size 2 --ep-size 2 and get the following error that seems like a weight-loading issue.

[2025-09-19 11:27:06 TP0 EP0] Init torch distributed ends. mem usage=0.90 GB
[2025-09-19 11:27:07 TP0 EP0] Load weight begin. avail mem=93.78 GB
[2025-09-19 11:27:08 TP0 EP0] Using fa3 as multimodal attention backend.
[2025-09-19 11:27:08 TP1 EP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
[2025-09-19 11:27:14 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 2790, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 352, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 73, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 261, in init
self.initialize(min_per_gpu_memory)
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 308, in initialize
self.load_model()
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 751, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 491, in load_model
self.load_weights_and_postprocess(
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 499, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 355, in load_weights
self.load_fused_expert_weights(
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 250, in load_fused_expert_weights
weight_loader(param,
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 499, in weight_loader
self._weight_loader_impl(
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 587, in _weight_loader_impl
expert_data = param.data[expert_id]
~~~~~~~~~~^^^^^^^^^^^
IndexError: index 64 is out of bounds for dimension 0 with size 64

yhyang201 · 2025-09-19T04:38:29Z

Does the current implementation supports expert parallelism for Qwen3VL MoE models? I try to launch the server with --tp-size 2 --ep-size 2 and get the following error that seems like a weight-loading issue.
[2025-09-19 11:27:06 TP0 EP0] Init torch distributed ends. mem usage=0.90 GB
[2025-09-19 11:27:07 TP0 EP0] Load weight begin. avail mem=93.78 GB
[2025-09-19 11:27:08 TP0 EP0] Using fa3 as multimodal attention backend.
[2025-09-19 11:27:08 TP1 EP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
[2025-09-19 11:27:14 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 2790, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 352, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 73, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 261, in init
self.initialize(min_per_gpu_memory)
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 308, in initialize
self.load_model()
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 751, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 491, in load_model
self.load_weights_and_postprocess(
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 499, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 355, in load_weights
self.load_fused_expert_weights(
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 250, in load_fused_expert_weights
weight_loader(param,
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 499, in weight_loader
self._weight_loader_impl(
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 587, in _weight_loader_impl
expert_data = param.data[expert_id]
IndexError: index 64 is out of bounds for dimension 0 with size 64

Which model weights are you using?

Alexhaoge · 2025-09-19T10:39:32Z

Does the current implementation supports expert parallelism for Qwen3VL MoE models? I try to launch the server with --tp-size 2 --ep-size 2 and get the following error that seems like a weight-loading issue.
[2025-09-19 11:27:06 TP0 EP0] Init torch distributed ends. mem usage=0.90 GB
[2025-09-19 11:27:07 TP0 EP0] Load weight begin. avail mem=93.78 GB
[2025-09-19 11:27:08 TP0 EP0] Using fa3 as multimodal attention backend.
[2025-09-19 11:27:08 TP1 EP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
[2025-09-19 11:27:14 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 2790, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 352, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 73, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 261, in init
self.initialize(min_per_gpu_memory)
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 308, in initialize
self.load_model()
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 751, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 491, in load_model
self.load_weights_and_postprocess(
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 499, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 355, in load_weights
self.load_fused_expert_weights(
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 250, in load_fused_expert_weights
weight_loader(param,
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 499, in weight_loader
self._weight_loader_impl(
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 587, in _weight_loader_impl
expert_data = param.data[expert_id]
IndexError: index 64 is out of bounds for dimension 0 with size 64
Which model weights are you using?

Qwen3-VL-30B-A3B-Instruct with random weights generated using transformers. I initialize the model config with transformers.models.qwen3_vl_moe.Qwen3VLMoeConfig then align the LLM part's config with Qwen3-30B-A3B. The weights work for tp2 but failed with tp2ep2.

casper-hansen · 2025-09-21T15:22:52Z

This PR for Qwen3-VL lacks LoRA compatibility (same as Qwen2.5-VL).

The following helps the LoRA manager skip unsupported modules. (reference issue: #6608)

Dense (only support qkvo and mlp lora on language model):

    lora_pattern = re.compile(
        r"^language_model\.layers\.(\d+)\.(?:self_attn|mlp)\.(?:qkv_proj|o_proj|down_proj|gate_up_proj)"
    )

    def should_apply_lora(self, module_name: str) -> bool:
        return bool(self.lora_pattern.match(module_name))

MoE (only supports qkvo lora on language model):

    lora_pattern = re.compile(
        r"^language_model\.layers\.(\d+)\.(?:self_attn)\.(?:qkv_proj|o_proj)"
    )

    def should_apply_lora(self, module_name: str) -> bool:
        return bool(self.lora_pattern.match(module_name))

Without the above code, we don't skip the vision LoRA and it causes an error in the following loop:

sglang/python/sglang/srt/lora/lora_manager.py

Lines 431 to 451 in fc3e542

    
           for module_name, module in self.base_model.named_modules(): 
        
               # TODO (lifuhuang): in the future, we should consider generalizing the 
        
               # should_apply_lora function to support mapping by full module name instead 
        
               # of just the last part (e.g., "qkv_proj") to support scenarios with multiple 
        
               # attention stacks (e.g., multimodal models). 
        
               # See: https://github.com/sgl-project/sglang/issues/6608 
        
               if getattr( 
        
                   self.base_model, "should_apply_lora", None 
        
               ) and not self.base_model.should_apply_lora(module_name): 
        
                   continue 
        
               # Skip vision model 
        
               if self.should_skip_lora_for_vision_model(module_name): 
        
                   continue 
        
               # The module should be converted if it is included in target_names 
        
               if module_name.split(".")[-1] in self.target_modules: 
        
                   layer_id = get_layer_id(module_name) 
        
                   self.lora_modules[layer_id][module_name] = self.set_lora_module( 
        
                       module_name, module 
        
                   )

This happens because should_apply_lora is not defined, causing an issue when trying to load:

  File "/root/.venv/lib/python3.11/site-packages/sglang/srt/lora/lora_manager.py", line 439, in init_lora_modules
    self.lora_modules[layer_id][module_name] = self.set_lora_module(
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^
TypeError: list indices must be integers or slices, not NoneType

mickqian · 2025-09-22T01:20:08Z

@casper-hansen thanks casper, do we already have some LoRAs for this model yet? if not, we can merge this and move LoRA support to another PR

…glang-internal into qwen3vl_ready_to_pr

Co-authored-by: ocss884 <ocss.lin@gmail.com> Co-authored-by: cao1zhg <653506626@qq.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>

add qwen3vl

fe0d549

Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: cao1zhg <653506626@qq.com>

zju-stu-lizheng requested review from BBuf, Edwardf0t1, HaiShaw, JustinTong0323, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy, mickqian, xiezhq-hermann and zhyncs as code owners September 11, 2025 12:25

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

mickqian changed the title ~~Adding Support for Qwen3-VL Series~~ model: support qwen3-vl series Sep 12, 2025

mickqian requested changes Sep 16, 2025

View reviewed changes

python/sglang/srt/managers/mm_utils.py Show resolved Hide resolved

python/sglang/srt/models/qwen3_vl.py Outdated Show resolved Hide resolved

python/sglang/srt/models/qwen3_vl_moe.py Outdated Show resolved Hide resolved

python/sglang/srt/multimodal/processors/qwen_vl.py Show resolved Hide resolved

yhyang201 and others added 2 commits September 21, 2025 16:08

resolve comments and fix lint

b805b36

Merge branch 'main' into qwen3vl_ready_to_pr

6035992

yhyang201 added the run-ci label Sep 21, 2025

fix Qwen3VLVisionPatchEmbed

d4d961a

yhyang201 requested a review from mickqian September 22, 2025 08:00

Merge branch 'main' into qwen3vl_ready_to_pr

263c5ee

zju-stu-lizheng and others added 2 commits September 22, 2025 20:02

[bugfix] tp+ep for moe load weight

9cd052f

[fixbug] ep load moe weight

69b8033

yhyang201 added ready-to-merge The PR is ready to merge after the CI is green. and removed ready-to-merge The PR is ready to merge after the CI is green. labels Sep 22, 2025

mickqian approved these changes Sep 22, 2025

View reviewed changes

mickqian and others added 2 commits September 22, 2025 21:13

lint

524ebb0

Merge branch 'main' into qwen3vl_ready_to_pr

ad302c0

zhyncs added the high priority label Sep 22, 2025

zhyncs assigned mickqian Sep 22, 2025

yuche.lz added 4 commits September 23, 2025 10:36

fix: move transformers config to sglang config

e0fcfe3

Merge branch 'fix_qwen3vl' of http://gitlab.alibaba-inc.com/DamoAGI/s…

5d1c692

…glang-internal into qwen3vl_ready_to_pr

remove Qwen3VLProcessor

90b4169

lint

1e6ef8c

merrymercy merged commit 4f564b9 into sgl-project:main Sep 23, 2025
138 of 167 checks passed

wsbagnsv1 mentioned this pull request Sep 23, 2025

Feature Request: support qwen3-vl series ggml-org/llama.cpp#16207

Closed

4 tasks

junna2016 mentioned this pull request Sep 28, 2025

[Feature] Sglang Support for Qwen3VL and Qwen3 Omni #10882

Closed

2 tasks

jakexcosme mentioned this pull request Oct 22, 2025

Feature Request: support qwen3-vl series COG-GTM/llama.cpp#110

Open

4 tasks

casper-hansen mentioned this pull request Oct 25, 2025

[Feature] Add LoRA support for Qwen3 VL #12122

Open

2 tasks

model: support qwen3-vl series #10323

model: support qwen3-vl series #10323

Uh oh!

Conversation

zju-stu-lizheng commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ocss884 commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alexhaoge commented Sep 19, 2025

Uh oh!

yhyang201 commented Sep 19, 2025

Uh oh!

Alexhaoge commented Sep 19, 2025

Uh oh!

casper-hansen commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

casper-hansen commented Sep 21, 2025 •

edited

Loading