enable Qwen3-VL-MoE model quantization, save and vllm loading

For the Qwen3-VL-MoE models (e.g., Qwen/Qwen3-VL-30B-A3B-Instruct), the fused MoE architecture (similar to LLaMA 4 and GPT-OSS) requires additional support for quantization.

Since MoE parameters constitute the majority of the model, this significantly affects the achievable compression ratio.

To ensure compatibility, expert quantization should follow the VLLM-supported tensor shape for proper model loading.

Reference quantized model and example:

Model: [QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ](https://huggingface.co/QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ)

Example : [vllm-project/llm-compressor — qwen3-vl-30b-a3b-Instruct-example.py](https://github.com/vllm-project/llm-compressor/blob/0f346cf79da51a72015f921ace5f251968b72f48/examples/awq/qwen3-vl-30b-a3b-Instruct-example.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable Qwen3-VL-MoE model quantization, save and vllm loading #942

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

enable Qwen3-VL-MoE model quantization, save and vllm loading #942

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions