You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.
Describe the solution you'd like
According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.
It's also worth to mention:
Interleaved MRoPE, which we don't seem to support yet
DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version.
Checklist
I have searched the existing issues for similar feature requests.
This is not a support question (please use the "bug template" for that).
Is your feature request related to a problem? Please describe.
Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.
Describe the solution you'd like
According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.
It's also worth to mention:
Additional context
A couple of design questions:
Checklist