-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
permute/unpermute kernel for moe optimization #14568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
请问这个算子是为后面做group_gemm做准备工作嘛 |
yes |
请问预计什么时候会支持呐
…---- Replied Message ----
| From | ***@***.***> |
| Date | 03/12/2025 16:54 |
| To | vllm-project/vllm ***@***.***> |
| Cc | gaoziyuan ***@***.***>,
Comment ***@***.***> |
| Subject | Re: [vllm-project/vllm] permute/unpermute kernel for moe optimization (PR #14568) |
请问这个算子是为后面做group_gemm做准备工作嘛
yes
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
CalebDu left a comment (vllm-project/vllm#14568)
请问这个算子是为后面做group_gemm做准备工作嘛
yes
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
我也不清楚具体release的时间 |
8d6c8fb
to
076bee0
Compare
update |
This PR submits 2 kernel
|
Hi @CalebDu , thanks for working on this. I think this might help me out with #13932. I am currently trying to use your PR to do the permute/unpermute steps needed for the DeepGemm grouped gemm kernel. I did run into a test failure with certain problem sizes that are not in the original test, e.g.
Do you have any idea where the problem might be? |
I'll try this test case, figure out why mismatch and fix it soon. |
@bnellnm this bug is caused by the workspace is too small for cub radix_sort. After I expand workspace, I fix it. |
Thanks! |
This pull request has merge conflicts that must be resolved before it can be |
Add |
@CalebDu , thanks for adding the blocking support! I've been working on integrating the new version with DeepGemm but I'm running into problems with the
|
I notice deepgemm update document about
And
Do you have a better idea? |
The The above code snippet works for DeepGemm (without an explicit expert_map). I'll have to add some tests + code for the user suppled expert_map. At the moment unused values get chopped off at the end by the inverse map. But it would probably be better if we could chop them off up front like you suggest. I'm not sure of a clean/simple way to do that since they'd need to be chopped off in 128 element sized chunks (at least for DeepGemm). |
I update code about fill padding row with |
Cool, I didn't realize you could use the first offset tokens to exclude the unused bits. If that's the case I don't think it matters what goes into m_indices since they will be sliced off anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks nice and clean. Are there any benchmark results?
(I did spot one potential overflow that should be addressed)
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
…ices`rather than -1, Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
1. remove unused code 2. move all the non-trivial definitions from moe_permute_unpermute_kernel.h to .cu and .inl 3. some minor update Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
…ll invoking fused_topk code Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
2. fix potential overflow and remove debug cruft with tlrmchlsmth's review 3. add benchmark for performance Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
@tlrmchlsmth I update code with your review. And fix ci failed in calling FusedMoE.select_experts.
benchmark in H20
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for posting the performance numbers!
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: minpeter <kali2005611@gmail.com>
Looks like this was missed from vllm-project#14568 It caused issues when we build vllm with TORCH_CUDA_ARCH_LIST being specified such as TORCH_CUDA_ARCH_LIST="9.0a". Because we didn't pass CUDA_ARCHS correctly to compile moe_permute_unpermute_op, we ended up with the following failure: Without the fix: ``` $ pytest tests/kernels/moe/test_cutlass_moe.py -k test_run_cutlass_moe_fp8[8-True-False-128-1-8192-5120-31] ... RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain ``` With the fix: ``` $ pytest tests/kernels/moe/test_cutlass_moe.py -k test_run_cutlass_moe_fp8[8-True-False-128-1-8192-5120-31] ... 1 passed, 461 deselected, 2 warnings in 5.00s ``` Tags: Signed-off-by: Yang Chen <yangche@fb.com>
Looks like this was missed from vllm-project#14568 It caused issues when we build vllm with TORCH_CUDA_ARCH_LIST being specified such as TORCH_CUDA_ARCH_LIST="9.0a". Because we didn't pass CUDA_ARCHS correctly to compile moe_permute_unpermute_op, we ended up with the following failure: Without the fix: ``` $ pytest tests/kernels/moe/test_cutlass_moe.py -k test_run_cutlass_moe_fp8[8-True-False-128-1-8192-5120-31] ... RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain ``` With the fix: ``` $ pytest tests/kernels/moe/test_cutlass_moe.py -k test_run_cutlass_moe_fp8[8-True-False-128-1-8192-5120-31] ... 1 passed, 461 deselected, 2 warnings in 5.00s ``` Tags: Signed-off-by: Yang Chen <yangche@fb.com>
Looks like this was missed from vllm-project#14568 It caused issues when we build vllm with TORCH_CUDA_ARCH_LIST being specified such as TORCH_CUDA_ARCH_LIST="9.0a". Because we didn't pass CUDA_ARCHS correctly to compile moe_permute_unpermute_op, we ended up with the following failure: Without the fix: ``` $ pytest tests/kernels/moe/test_cutlass_moe.py -k test_run_cutlass_moe_fp8[8-True-False-128-1-8192-5120-31] ... RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain ``` With the fix: ``` $ pytest tests/kernels/moe/test_cutlass_moe.py -k test_run_cutlass_moe_fp8[8-True-False-128-1-8192-5120-31] ... 1 passed, 461 deselected, 2 warnings in 5.00s ``` Tags: Signed-off-by: Yang Chen <yangche@fb.com>
moe_permute
kernel expands and oreders token in activation to gather uncontinuous tokens for each expert. And then call grouped-gemm for moe speedup.moe_unpermute
kernel reduces expanded grouped-gemm output and scales withtopk_weight
.This implementation refers to moe kernel in tensorrt-llm in archive https://github.com/BBuf/tensorrt-llm-moe/tree/master.
Currently, unsupport Expert-Parallelism with
expert_map
, will follow up with updates.