Open
Description
vLLM v0.8.4 and higher natively supports all Qwen3 and Qwen3MoE models. Example command:
-
vllm serve Qwen/... --enable-reasoning --reasoning-parser deepseek_r1
- All models should work with the command as above. You can test the reasoning parser with the following example script: https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py
- Some MoE models might not be divisible by TP 8. Either lower your TP size or use
--enable-expert-parallel
.
-
If you are seeing the following error when running fp8 dense models, you are running on vLLM v0.8.4. Please upgrade to v0.8.5.
File ".../vllm/model_executor/parameter.py", line 149, in load_qkv_weight
param_data = param_data.narrow(self.output_dim, shard_offset,
IndexError: start out of range (expected to be in range of [-18, 18], but got 2048)
- If you are seeing the following error when running MoE models with fp8, you are running with too much tensor parallelize degree that the weights are not divisible. Consider
--tensor-parallel-size 4
or--tensor-parallel-size 8 --enable-expert-parallel
.
File ".../vllm/vllm/model_executor/layers/quantization/fp8.py", line 477, in create_weights
raise ValueError(
ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.