Skip to content

Qwen3-VL #1063

@ridcl

Description

@ridcl

Is your feature request related to a problem? Please describe.

Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.

Describe the solution you'd like

According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.

It's also worth to mention:

  • Interleaved MRoPE, which we don't seem to support yet
  • DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
  • Video timestamp

Additional context

A couple of design questions:

  1. Should we wait for Uiuc vlm pr compressed fixed #511 to be merged or start the work on Qwen3-VL in parallel? Calling @abheesht17 for an opinion.
  2. Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version.

Checklist

  • I have searched the existing issues for similar feature requests.
  • This is not a support question (please use the "bug template" for that).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions