-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167
Conversation
Signed-off-by: Ming Yang <yming@meta.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @minosfuture, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request resolves a critical correctness bug in the cutlass_moe
kernel, which was caused by uninitialized memory in the workspace after a recent change from torch.zeros
to torch.empty
. My fix ensures that the relevant workspace is explicitly zero-filled, restoring correct behavior for models utilizing this kernel, such as Maverick. I have also added comprehensive unit tests to prevent future regressions and expand test coverage for MoE operations.
Highlights
- Bug Fix: MoE Correctness: I have addressed a correctness issue in
cutlass_moe
by explicitly filling thec1
workspace (derived fromworkspace13
) with zeros. This resolves a problem wheretorch.empty
was used for workspace creation, leading to uninitialized memory causing incorrect results for models like Maverick. - New Unit Test for Workspace Initialization: I have added a new unit test,
test_run_cutlass_moe_fp8
, which specifically verifies that therun_cutlass_moe_fp8
function produces identical results whether its internal workspace is initialized with random values or explicitly filled with zeros. This test directly validates the bug fix and ensures future regressions are caught. - Expanded Test Coverage: I have introduced
test_cutlass_moe_8_bit_EP_large
, a new parameterized test case for larger 8-bit MoE configurations, to further enhance the test suite's coverage and robustness. - Minor API Correction: I have corrected the argument order for the
activation_callable
lambda function within theCutlassMoe.apply
method to ensure proper function signature matching.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a correctness issue in cutlass_moe
by filling zero to cache space. The added unit test validates the fix. I have a suggestion to improve the efficiency of the new unit test.
Signed-off-by: Ming Yang <yming@meta.com>
66c457b
to
25d3af8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the unit test to repro the issue!
cc: @bnellnm to also take a look!
@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8( | |||
c1 = _resize_cache(workspace13, (M * topk, N * 2)) | |||
c2 = _resize_cache(workspace2, (M * topk, N)) | |||
c3 = _resize_cache(workspace13, (M * topk, K)) | |||
c1.fill_(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like this should impact both chunking and non chunking path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the non-chunking (batched) path is not impacted because c1 is fully overridden. I don't have solid proof though (I need to look into that code path more). @bnellnm / @ElizaWszola comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this locally and found that the batched case needs to be cleared also. I think it's probably best to unconditionally zero out c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. updated. Could you share how to run batched case tests? thx.
@@ -365,3 +369,131 @@ def test_cutlass_moe_8_bit_EP( | |||
cutlass_output, | |||
atol=5e-2, | |||
rtol=1e-2) | |||
|
|||
|
|||
@pytest.mark.parametrize("m,n,k,topk", [(1, 8192, 5120, 31)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have some m > 32k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
from vllm.model_executor.layers.fused_moe.fused_moe import (fused_experts, | ||
fused_topk) | ||
from vllm.model_executor.layers.fused_moe.utils import ( | ||
moe_kernel_quantize_input) | ||
from vllm.platforms import current_platform | ||
|
||
NUM_EXPERTS = [40, 64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it help with the working sets at line 38-39?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately no. We can look into this more separately.
@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8( | |||
c1 = _resize_cache(workspace13, (M * topk, N * 2)) | |||
c2 = _resize_cache(workspace2, (M * topk, N)) | |||
c3 = _resize_cache(workspace13, (M * topk, K)) | |||
c1.fill_(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed when we don't use expert_map
? In case it's not, can you write a condition for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. updated!
per_out_channel: bool, | ||
ep_size: int, | ||
): | ||
current_platform.seed_everything(7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the body of this is the same as test_cutlass_moe_8_bit_EP
can you factor it out into a common function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! thx. Updated with refactoring a few similar test functions here.
Signed-off-by: Ming Yang <yming@meta.com>
): | ||
current_platform.seed_everything(7) | ||
monkeypatch.setenv("VLLM_FUSED_MOE_CHUNK_SIZE", "8192") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not very git-friendly, but note this line is not removed during refactoring. see test_cutlass_moe_8_bit_no_graph
Signed-off-by: Ming Yang <yming@meta.com>
Purpose
#19667 changed the workspace creation from
torch.zeros
totorch.empty
. This ends up causing correctness for models using cutlass_moe, e.g. Maverick in our test case. This PR fixes the correctness issue by explicitly filling zeros in cutlass_moe.Test Plan
Test Result
lm_eval results:
local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8081/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1
unit test stability verified:
c1.fill_(0)
, the following one liner verifies stable failure:c1.fill_(0)
, the following verifies stable success:(Optional) Documentation Update