Open
Description
Proposal to improve performance
By reading relative parts of source code and running some test, we find that when launching a MoE model like Qwen3, vLLM seems to use Triton-based fused moe kernel. While other implementations like cutlass or deep gemm is only supported by specific GPU arch like Hopper, or specific quantization method like Compressed Tensor.
Is there a way to specify a type of fused moe kernel to use? For example I might want to compare the performance of Triten-based and Cutlass-based implementation on my A100 GPUs.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.