-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized fused MoE Kernel #2913
Conversation
Hi @pcmoritz Thanks for the amazing PR! Is this PR ready for review? Or, do you have any blocker to the PR? |
I think we should merge your kernel https://github.com/vllm-project/vllm/tree/cutlass-moe as a separate PR and then we can merge this one. If you open the PR about the TensorRT kernels, I'm happy to review it! The thing I'm currently unsure about is whether we should have two different kernels in the two different regimes, that seems very unfortunate to me. I'll be looking a little more if we can get more out of the triton kernel in the low batch size regime and will keep you updated. Let's come to a conclusion before the end of this week and execute on it :) Also I'm curious about your thoughts on this (stitching together two kernels). |
Closed in favor of #2979 |
This PR is based on @WoosukKwon 's excellent work in porting the TensorRT MoE kernels in https://github.com/vllm-project/vllm/tree/cutlass-moe
It is based on the observation that the TensorRT MoE kernels are working very well in the small batch size regime, whereas the fused MoE kernel is working much better in the large batch size regime. I have been trying to optimize the triton kernels in the small batch size regime too, but unfortunately triton doesn't seem to have great support for matrix multiplications that involve skinny matrices (e.g. tl.dot only supports dimensions >= 16). Therefore, we use the TensorRT kernel in the small batch size regime and the fused MoE kernels in the large batch size regime. It would be much preferable to have one unified kernel for all regimes, so if anybody knows how to make that happen, I'd love to know.
This PR also incorporates some of @cadedaniel 's work on autotuning the fused MoE kernel.
The benchmarks are as follows (all on H100 with TP2, using 1000 input and 50 output tokens):
This PR with below tuning configs:
current main branch (untuned fused MoE kernel):
only using the TensorRT Moe kernels:
You can run the autotuned kernel by setting
where
fused_moe_h100_tp2_config.json
contains the following file: