Skip to content

Conversation

@yifeizhang-c
Copy link
Contributor

Support CUDA Graph for internode dispatch kernels with the same logic as what has been done for intranode dispatch kernels.

while (ld_volatile_global(moe_recv_rdma_counter_mapped) != -1);
*moe_recv_rdma_counter_mapped = sum;
if (num_worst_tokens == 0) {
while (ld_volatile_global(moe_recv_rdma_counter_mapped) != -1);
Copy link
Contributor Author

@yifeizhang-c yifeizhang-c Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish to double confirm the design here. Is the while (ld_volatile_global(...)) logic here aiming for cache coherency? That device side need to check whether the host side value update has already been written back before device side make the update.
I wish to confirm this because intranode dispatch does not have such logic.

@yifeizhang-c yifeizhang-c force-pushed the enable-internode-cuda-graph branch 3 times, most recently from 610b076 to 6091f94 Compare October 28, 2025 09:07
@yifeizhang-c
Copy link
Contributor Author

@sphish Hi, can you help review this PR? Thanks!

@yifeizhang-c yifeizhang-c force-pushed the enable-internode-cuda-graph branch from e8ebcaf to 6ad4396 Compare November 5, 2025 06:28
@yifeizhang-c yifeizhang-c force-pushed the enable-internode-cuda-graph branch from 6ad4396 to d5e6717 Compare November 5, 2025 07:29
@sphish sphish merged commit 92fe2de into deepseek-ai:main Nov 5, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants