[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

minosfuture · 2025-06-27T05:47:06Z

Purpose

#19667 changed the workspace creation from torch.zeros to torch.empty. This ends up causing correctness for models using cutlass_moe, e.g. Maverick in our test case. This PR fixes the correctness issue by explicitly filling zeros in cutlass_moe.

Test Plan

lm_eval
added ut that would fail without this fix

Test Result

lm_eval results:

local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8081/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.935	±	0.0175
		strict-match	5	exact_match	↑	0.920	±	0.0192

unit test stability verified:

without c1.fill_(0), the following one liner verifies stable failure:

for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null && { echo "shouldn't succeed"; exit 1; } done

with c1.fill_(0), the following verifies stable success:

for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null || { echo "should succeed"; exit 1; } done

(Optional) Documentation Update

Signed-off-by: Ming Yang <yming@meta.com>

github-actions · 2025-06-27T05:47:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @minosfuture, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical correctness bug in the cutlass_moe kernel, which was caused by uninitialized memory in the workspace after a recent change from torch.zeros to torch.empty. My fix ensures that the relevant workspace is explicitly zero-filled, restoring correct behavior for models utilizing this kernel, such as Maverick. I have also added comprehensive unit tests to prevent future regressions and expand test coverage for MoE operations.

Highlights

Bug Fix: MoE Correctness: I have addressed a correctness issue in cutlass_moe by explicitly filling the c1 workspace (derived from workspace13) with zeros. This resolves a problem where torch.empty was used for workspace creation, leading to uninitialized memory causing incorrect results for models like Maverick.
New Unit Test for Workspace Initialization: I have added a new unit test, test_run_cutlass_moe_fp8, which specifically verifies that the run_cutlass_moe_fp8 function produces identical results whether its internal workspace is initialized with random values or explicitly filled with zeros. This test directly validates the bug fix and ensures future regressions are caught.
Expanded Test Coverage: I have introduced test_cutlass_moe_8_bit_EP_large, a new parameterized test case for larger 8-bit MoE configurations, to further enhance the test suite's coverage and robustness.
Minor API Correction: I have corrected the argument order for the activation_callable lambda function within the CutlassMoe.apply method to ensure proper function signature matching.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a correctness issue in cutlass_moe by filling zero to cache space. The added unit test validates the fix. I have a suggestion to improve the efficiency of the new unit test.

tests/kernels/moe/test_cutlass_moe.py

Signed-off-by: Ming Yang <yming@meta.com>

yeqcharlotte

thanks for the unit test to repro the issue!

cc: @bnellnm to also take a look!

yeqcharlotte · 2025-06-27T06:24:53Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
        c1 = _resize_cache(workspace13, (M * topk, N * 2))
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))
+        c1.fill_(0)


sounds like this should impact both chunking and non chunking path?

I think the non-chunking (batched) path is not impacted because c1 is fully overridden. I don't have solid proof though (I need to look into that code path more). @bnellnm / @ElizaWszola comments?

I tested this locally and found that the batched case needs to be cleared also. I think it's probably best to unconditionally zero out c1

I see. updated. Could you share how to run batched case tests? thx.

yeqcharlotte · 2025-06-27T06:25:11Z

tests/kernels/moe/test_cutlass_moe.py

@@ -365,3 +369,131 @@ def test_cutlass_moe_8_bit_EP(
                                   cutlass_output,
                                   atol=5e-2,
                                   rtol=1e-2)
+
+
+@pytest.mark.parametrize("m,n,k,topk", [(1, 8192, 5120, 31)])


let's have some m > 32k

yeqcharlotte · 2025-06-27T06:41:53Z

tests/kernels/moe/test_cutlass_moe.py

 from vllm.model_executor.layers.fused_moe.fused_moe import (fused_experts,
                                                            fused_topk)
+from vllm.model_executor.layers.fused_moe.utils import (
+    moe_kernel_quantize_input)
 from vllm.platforms import current_platform

 NUM_EXPERTS = [40, 64]


does it help with the working sets at line 38-39?

unfortunately no. We can look into this more separately.

ElizaWszola · 2025-06-27T12:29:26Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
        c1 = _resize_cache(workspace13, (M * topk, N * 2))
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))
+        c1.fill_(0)


Is this needed when we don't use expert_map? In case it's not, can you write a condition for this?

thanks. updated!

bnellnm · 2025-06-27T15:23:16Z

tests/kernels/moe/test_cutlass_moe.py

+    per_out_channel: bool,
+    ep_size: int,
+):
+    current_platform.seed_everything(7)


If the body of this is the same as test_cutlass_moe_8_bit_EP can you factor it out into a common function?

Good point! thx. Updated with refactoring a few similar test functions here.

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture · 2025-06-27T22:36:34Z

tests/kernels/moe/test_cutlass_moe.py

 ):
    current_platform.seed_everything(7)
-    monkeypatch.setenv("VLLM_FUSED_MOE_CHUNK_SIZE", "8192")


this is not very git-friendly, but note this line is not removed during refactoring. see test_cutlass_moe_8_bit_no_graph

Signed-off-by: Ming Yang <yming@meta.com>

[Bugfix] Fix Maverick correctness by filling zero to cache space

890ed4f

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture requested review from tlrmchlsmth and WoosukKwon as code owners June 27, 2025 05:47

gemini-code-assist bot reviewed Jun 27, 2025

View reviewed changes

tests/kernels/moe/test_cutlass_moe.py Show resolved Hide resolved

Add unit test case that would fail without filling zeros to c1

25d3af8

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture force-pushed the fix_maverick_correctness branch from 66c457b to 25d3af8 Compare June 27, 2025 06:07

minosfuture mentioned this pull request Jun 27, 2025

[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE #20166

Open

yeqcharlotte requested review from mgoin and houseroad June 27, 2025 06:43

yeqcharlotte reviewed Jun 27, 2025

View reviewed changes

ElizaWszola reviewed Jun 27, 2025

View reviewed changes

bnellnm reviewed Jun 27, 2025

View reviewed changes

Address comments: func extraction; check expert_map; larger m

73e55f2

Signed-off-by: Ming Yang <yming@meta.com>

minosfuture commented Jun 27, 2025

View reviewed changes

Address comment: fix batched code path as well

7476168

Signed-off-by: Ming Yang <yming@meta.com>

Uh oh!

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

Are you sure you want to change the base?

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

Uh oh!

Conversation

minosfuture commented Jun 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

minosfuture commented Jun 27, 2025 •

edited by github-actions bot

Loading