Skip to content

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

minosfuture
Copy link

@minosfuture minosfuture commented Jun 27, 2025

Purpose

#19667 changed the workspace creation from torch.zeros to torch.empty. This ends up causing correctness for models using cutlass_moe, e.g. Maverick in our test case. This PR fixes the correctness issue by explicitly filling zeros in cutlass_moe.

Test Plan

  • lm_eval
  • added ut that would fail without this fix

Test Result

lm_eval results:

local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8081/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.935 ± 0.0175
strict-match 5 exact_match 0.920 ± 0.0192

unit test stability verified:

  • without c1.fill_(0), the following one liner verifies stable failure:
for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null && { echo "shouldn't succeed"; exit 1; } done
  • with c1.fill_(0), the following verifies stable success:
for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null || { echo "should succeed"; exit 1; } done

(Optional) Documentation Update

Signed-off-by: Ming Yang <yming@meta.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @minosfuture, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical correctness bug in the cutlass_moe kernel, which was caused by uninitialized memory in the workspace after a recent change from torch.zeros to torch.empty. My fix ensures that the relevant workspace is explicitly zero-filled, restoring correct behavior for models utilizing this kernel, such as Maverick. I have also added comprehensive unit tests to prevent future regressions and expand test coverage for MoE operations.

Highlights

  • Bug Fix: MoE Correctness: I have addressed a correctness issue in cutlass_moe by explicitly filling the c1 workspace (derived from workspace13) with zeros. This resolves a problem where torch.empty was used for workspace creation, leading to uninitialized memory causing incorrect results for models like Maverick.
  • New Unit Test for Workspace Initialization: I have added a new unit test, test_run_cutlass_moe_fp8, which specifically verifies that the run_cutlass_moe_fp8 function produces identical results whether its internal workspace is initialized with random values or explicitly filled with zeros. This test directly validates the bug fix and ensures future regressions are caught.
  • Expanded Test Coverage: I have introduced test_cutlass_moe_8_bit_EP_large, a new parameterized test case for larger 8-bit MoE configurations, to further enhance the test suite's coverage and robustness.
  • Minor API Correction: I have corrected the argument order for the activation_callable lambda function within the CutlassMoe.apply method to ensure proper function signature matching.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a correctness issue in cutlass_moe by filling zero to cache space. The added unit test validates the fix. I have a suggestion to improve the efficiency of the new unit test.

Signed-off-by: Ming Yang <yming@meta.com>
Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the unit test to repro the issue!

cc: @bnellnm to also take a look!

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
c1 = _resize_cache(workspace13, (M * topk, N * 2))
c2 = _resize_cache(workspace2, (M * topk, N))
c3 = _resize_cache(workspace13, (M * topk, K))
c1.fill_(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like this should impact both chunking and non chunking path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the non-chunking (batched) path is not impacted because c1 is fully overridden. I don't have solid proof though (I need to look into that code path more). @bnellnm / @ElizaWszola comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally and found that the batched case needs to be cleared also. I think it's probably best to unconditionally zero out c1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. updated. Could you share how to run batched case tests? thx.

@@ -365,3 +369,131 @@ def test_cutlass_moe_8_bit_EP(
cutlass_output,
atol=5e-2,
rtol=1e-2)


@pytest.mark.parametrize("m,n,k,topk", [(1, 8192, 5120, 31)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have some m > 32k

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

from vllm.model_executor.layers.fused_moe.fused_moe import (fused_experts,
fused_topk)
from vllm.model_executor.layers.fused_moe.utils import (
moe_kernel_quantize_input)
from vllm.platforms import current_platform

NUM_EXPERTS = [40, 64]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it help with the working sets at line 38-39?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately no. We can look into this more separately.

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
c1 = _resize_cache(workspace13, (M * topk, N * 2))
c2 = _resize_cache(workspace2, (M * topk, N))
c3 = _resize_cache(workspace13, (M * topk, K))
c1.fill_(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed when we don't use expert_map? In case it's not, can you write a condition for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated!

per_out_channel: bool,
ep_size: int,
):
current_platform.seed_everything(7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the body of this is the same as test_cutlass_moe_8_bit_EP can you factor it out into a common function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! thx. Updated with refactoring a few similar test functions here.

Signed-off-by: Ming Yang <yming@meta.com>
):
current_platform.seed_everything(7)
monkeypatch.setenv("VLLM_FUSED_MOE_CHUNK_SIZE", "8192")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very git-friendly, but note this line is not removed during refactoring. see test_cutlass_moe_8_bit_no_graph

Signed-off-by: Ming Yang <yming@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants