Qwen3-VL

**Is your feature request related to a problem? Please describe.**

Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.

**Describe the solution you'd like**

According to the [technical report](https://arxiv.org/pdf/2511.21631), Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511. 

It's also worth to mention:

* Interleaved MRoPE, which we don't seem to support yet
* DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
* Video timestamp

**Additional context**

A couple of design questions: 

1. Should we wait for #511 to be merged or start the work on Qwen3-VL in parallel? Calling @abheesht17 for an opinion. 
2. Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version. 

**Checklist**

- [x] I have searched the existing issues for similar feature requests.
- [x] This is not a support question (please use the "bug template" for that).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-VL #1063

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3-VL #1063

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions