Skip to content

enable Qwen3-VL-MoE model quantization, save and vllm loading #942

@WeiweiZhang1

Description

@WeiweiZhang1

For the Qwen3-VL-MoE models (e.g., Qwen/Qwen3-VL-30B-A3B-Instruct), the fused MoE architecture (similar to LLaMA 4 and GPT-OSS) requires additional support for quantization.

Since MoE parameters constitute the majority of the model, this significantly affects the achievable compression ratio.

To ensure compatibility, expert quantization should follow the VLLM-supported tensor shape for proper model loading.

Reference quantized model and example:

Model: QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ

Example : vllm-project/llm-compressor — qwen3-vl-30b-a3b-Instruct-example.py

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions