Skip to content

Fix tensor shapes for DeepEP and DeepGEMM assertions #19546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,8 @@ def apply(
a2q = a2q.view(E, max_num_tokens, -1)
a2q_scale = a2q_scale.view(E, max_num_tokens, -1)

output = output.view(E, max_num_tokens, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment here to explain why this reshape is necessary. For example, clarifying that this view aligns the output tensor with the shape expected by the dg.m_grouped_gemm_fp8_fp8_bf16_nt_masked kernel would improve code clarity for future readers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output shape is incorrect it should be fixed in workspace_shapes. I think @varun-sundar-rabindranath already has some fixes for this in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptarasiewiczNV - This is the fix I have #19515 - In addition to resolving this it fixes a few other issues as well. Can you please take a look and see if it works for you ? much appreciated 🙌 Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great. I will be able to test it tomorrow, but definitely this is the proper direction so I will close this PR


dg.m_grouped_gemm_fp8_fp8_bf16_nt_masked((a2q, a2q_scale),
(w2, w2_scale),
out=output,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,11 @@ def finalize(self, output: torch.Tensor, fused_expert_output: torch.Tensor,
# weights have already been applied.
combine_topk_weights = torch.ones_like(topk_weights)

_, _, num_max_dispatch_tokens_per_rank, _, num_experts = self.handle
fused_expert_output = fused_expert_output.view(
num_experts // self.buffer.group_size,
self.buffer.group_size * num_max_dispatch_tokens_per_rank, -1)
Comment on lines +177 to +180
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It would be helpful to add a comment explaining this reshape. For instance, mentioning that this view operation transforms fused_expert_output to match the input shape requirements of the self.buffer.low_latency_combine kernel, possibly referencing the expected dimensions (e.g., (num_experts / group_size, group_size * num_max_dispatch_tokens_per_rank, hidden_size)), would enhance readability and maintainability.


# TODO (varun) : Enable zero copy mode
_, event, hook = self.buffer.low_latency_combine(
fused_expert_output,
Expand Down