Skip to content

Conversation

@zju-stu-lizheng
Copy link
Contributor

This PR introduces support for the upcoming Qwen3-VL models — including both dense and MoE variants, as well as Instruct and Thinking editions. As the next generation of the Qwen-VL family, Qwen3-VL delivers significant advancements in visual understanding while maintaining robust pure-text performance, achieving state-of-the-art results across complex multimodal benchmarks.

Core implementation details can be also found in the corresponding PR in Transformers repo:
🔗 huggingface/transformers#40795

Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: cao1zhg <653506626@qq.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @zju-stu-lizheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest Qwen3-VL multimodal models, encompassing both standard and Mixture-of-Experts configurations. The changes enable the system to process and leverage advanced visual understanding capabilities, enhancing its performance on complex multimodal benchmarks. The core objective is to broaden the range of supported state-of-the-art multimodal large language models.

Highlights

  • New Model Support: Introduces comprehensive support for the upcoming Qwen3-VL series, including both its dense and Mixture-of-Experts (MoE) variants, as well as Instruct and Thinking editions. This expands the multimodal capabilities of the system.
  • Deepstack Embedding Integration: Adds support for 'deepstack' embeddings within the multimodal utility functions, allowing for more sophisticated integration of visual features into the language model's hidden states. This involves modifying the embed_mm_inputs and general_mm_embed_routine functions to handle and pass deepstack-specific information.
  • Vision Encoder Components: New Python modules are added to define the Qwen3-VL's vision encoder architecture, including Qwen3_VisionPatchEmbed, Qwen3_VisionMLP, Qwen3_VisionBlock, Qwen3_VisionPatchMerger, and the overarching Qwen3_VisionTransformer.
  • Rotary Embedding Updates: The rotary embedding logic has been updated to explicitly include support for qwen3_vl and qwen3_vl_moe model types, ensuring correct positional encoding for these new models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-VL series of models. The changes primarily involve adding new model definitions for both dense and MoE variants and updating the surrounding infrastructure to handle them, including support for deepstack embeddings. The implementation appears to be a solid extension of the existing Qwen-VL support. I've identified a few areas for improvement, including a critical bug in an assert statement, a potential AttributeError in the MoE model, use of a magic number, and several maintainability issues like code duplication and incorrect type hints. Addressing these points will enhance the robustness and clarity of the new model support.

@ocss884
Copy link
Contributor

ocss884 commented Sep 11, 2025

@mickqian mickqian changed the title Adding Support for Qwen3-VL Series model: support qwen3-vl series Sep 12, 2025
@Alexhaoge
Copy link
Contributor

Does the current implementation supports expert parallelism for Qwen3VL MoE models? I try to launch the server with --tp-size 2 --ep-size 2 and get the following error that seems like a weight-loading issue.

[2025-09-19 11:27:06 TP0 EP0] Init torch distributed ends. mem usage=0.90 GB
[2025-09-19 11:27:07 TP0 EP0] Load weight begin. avail mem=93.78 GB
[2025-09-19 11:27:08 TP0 EP0] Using fa3 as multimodal attention backend.
[2025-09-19 11:27:08 TP1 EP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
[2025-09-19 11:27:14 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 2790, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 352, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 73, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 261, in init
self.initialize(min_per_gpu_memory)
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 308, in initialize
self.load_model()
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 751, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 491, in load_model
self.load_weights_and_postprocess(
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 499, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 355, in load_weights
self.load_fused_expert_weights(
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 250, in load_fused_expert_weights
weight_loader(param,
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 499, in weight_loader
self._weight_loader_impl(
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 587, in _weight_loader_impl
expert_data = param.data[expert_id]
~~~~~~~~~~^^^^^^^^^^^
IndexError: index 64 is out of bounds for dimension 0 with size 64

@yhyang201
Copy link
Collaborator

Does the current implementation supports expert parallelism for Qwen3VL MoE models? I try to launch the server with --tp-size 2 --ep-size 2 and get the following error that seems like a weight-loading issue.

[2025-09-19 11:27:06 TP0 EP0] Init torch distributed ends. mem usage=0.90 GB
[2025-09-19 11:27:07 TP0 EP0] Load weight begin. avail mem=93.78 GB
[2025-09-19 11:27:08 TP0 EP0] Using fa3 as multimodal attention backend.
[2025-09-19 11:27:08 TP1 EP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
[2025-09-19 11:27:14 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 2790, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 352, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 73, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 261, in init
self.initialize(min_per_gpu_memory)
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 308, in initialize
self.load_model()
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 751, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 491, in load_model
self.load_weights_and_postprocess(
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 499, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 355, in load_weights
self.load_fused_expert_weights(
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 250, in load_fused_expert_weights
weight_loader(param,
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 499, in weight_loader
self._weight_loader_impl(
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 587, in _weight_loader_impl
expert_data = param.data[expert_id]

IndexError: index 64 is out of bounds for dimension 0 with size 64

Which model weights are you using?

@Alexhaoge
Copy link
Contributor

Does the current implementation supports expert parallelism for Qwen3VL MoE models? I try to launch the server with --tp-size 2 --ep-size 2 and get the following error that seems like a weight-loading issue.

[2025-09-19 11:27:06 TP0 EP0] Init torch distributed ends. mem usage=0.90 GB
[2025-09-19 11:27:07 TP0 EP0] Load weight begin. avail mem=93.78 GB
[2025-09-19 11:27:08 TP0 EP0] Using fa3 as multimodal attention backend.
[2025-09-19 11:27:08 TP1 EP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
[2025-09-19 11:27:14 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 2790, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/scheduler.py", line 352, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 73, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 261, in init
self.initialize(min_per_gpu_memory)
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 308, in initialize
self.load_model()
File "/home/sglang/python/sglang/srt/model_executor/model_runner.py", line 751, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 491, in load_model
self.load_weights_and_postprocess(
File "/home/sglang/python/sglang/srt/model_loader/loader.py", line 499, in load_weights_and_postprocess
model.load_weights(weights)
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 355, in load_weights
self.load_fused_expert_weights(
File "/home/sglang/python/sglang/srt/models/qwen3_vl_moe.py", line 250, in load_fused_expert_weights
weight_loader(param,
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 499, in weight_loader
self._weight_loader_impl(
File "/home/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 587, in _weight_loader_impl
expert_data = param.data[expert_id]

IndexError: index 64 is out of bounds for dimension 0 with size 64

Which model weights are you using?

Qwen3-VL-30B-A3B-Instruct with random weights generated using transformers. I initialize the model config with transformers.models.qwen3_vl_moe.Qwen3VLMoeConfig then align the LLM part's config with Qwen3-30B-A3B. The weights work for tp2 but failed with tp2ep2.

@casper-hansen
Copy link

casper-hansen commented Sep 21, 2025

This PR for Qwen3-VL lacks LoRA compatibility (same as Qwen2.5-VL).

The following helps the LoRA manager skip unsupported modules. (reference issue: #6608)

  • Dense (only support qkvo and mlp lora on language model):
    lora_pattern = re.compile(
        r"^language_model\.layers\.(\d+)\.(?:self_attn|mlp)\.(?:qkv_proj|o_proj|down_proj|gate_up_proj)"
    )

    def should_apply_lora(self, module_name: str) -> bool:
        return bool(self.lora_pattern.match(module_name))
  • MoE (only supports qkvo lora on language model):
    lora_pattern = re.compile(
        r"^language_model\.layers\.(\d+)\.(?:self_attn)\.(?:qkv_proj|o_proj)"
    )

    def should_apply_lora(self, module_name: str) -> bool:
        return bool(self.lora_pattern.match(module_name))

Without the above code, we don't skip the vision LoRA and it causes an error in the following loop:

for module_name, module in self.base_model.named_modules():
# TODO (lifuhuang): in the future, we should consider generalizing the
# should_apply_lora function to support mapping by full module name instead
# of just the last part (e.g., "qkv_proj") to support scenarios with multiple
# attention stacks (e.g., multimodal models).
# See: https://github.com/sgl-project/sglang/issues/6608
if getattr(
self.base_model, "should_apply_lora", None
) and not self.base_model.should_apply_lora(module_name):
continue
# Skip vision model
if self.should_skip_lora_for_vision_model(module_name):
continue
# The module should be converted if it is included in target_names
if module_name.split(".")[-1] in self.target_modules:
layer_id = get_layer_id(module_name)
self.lora_modules[layer_id][module_name] = self.set_lora_module(
module_name, module
)

This happens because should_apply_lora is not defined, causing an issue when trying to load:

  File "/root/.venv/lib/python3.11/site-packages/sglang/srt/lora/lora_manager.py", line 439, in init_lora_modules
    self.lora_modules[layer_id][module_name] = self.set_lora_module(
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^
TypeError: list indices must be integers or slices, not NoneType

@mickqian
Copy link
Collaborator

@casper-hansen thanks casper, do we already have some LoRAs for this model yet? if not, we can merge this and move LoRA support to another PR

@yhyang201 yhyang201 requested a review from mickqian September 22, 2025 08:00
@yhyang201 yhyang201 added ready-to-merge The PR is ready to merge after the CI is green. and removed ready-to-merge The PR is ready to merge after the CI is green. labels Sep 22, 2025
@merrymercy merrymercy merged commit 4f564b9 into sgl-project:main Sep 23, 2025
138 of 167 checks passed
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Co-authored-by: ocss884 <ocss.lin@gmail.com>
Co-authored-by: cao1zhg <653506626@qq.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants