Skip to content

Fix: Add fence.proxy.async before TMA load in combine recv phase#632

Open
elvircrn wants to merge 1 commit into
deepseek-ai:mainfrom
elvircrn:correctness-fix
Open

Fix: Add fence.proxy.async before TMA load in combine recv phase#632
elvircrn wants to merge 1 commit into
deepseek-ai:mainfrom
elvircrn:correctness-fix

Conversation

@elvircrn
Copy link
Copy Markdown

@elvircrn elvircrn commented May 11, 2026

Background

Correctness issues with repeated runs of combine at bath sizes higher than 256 were found and issue was reported.

An investigation was done revealing doing __ldg on the buffer ptr instead of TMA make combine produce consistent results at bach size > 256 on NVL72. A plausible explanation for this is that switching to generic proxy resolves the issue.

Investigating further revealed that the local rank polls the generic proxy but then uses TMA to load data. Without fence.proxy.async, TMA can see stale data.

From https://docs.nvidia.com/cuda/parallel-thread-execution/#async-proxy:

Accessing the same memory location across multiple proxies needs a cross-proxy fence. 
For the async proxy, fence.proxy.async should be used to synchronize memory between 
generic proxy and the async proxy.

With the fence in place, the test scripts made as part of this report correct behavior.

Fence Performance Impact

low-latency benchmark (median of 3 rounds, 1000 iters each)
  T=1024  H=7168  E=256  topk=8  ranks=4
  worst-case rank (max across ranks per round)

  strategies:
    random      = topk over random scores, different experts per token (cross-rank RDMA)
    random-same = topk over random scores, same K experts for all tokens (cross-rank RDMA)
    local-rand  = random offsets mapped to local rank experts (no RDMA, varied experts)
    local-same  = every token picks the same K local experts (no RDMA, hotspot)
    remote-rand = random experts, each from a different remote rank (all cross-rank RDMA)

strategy     fmt       cleg_on  cleg_off   delta     pct
                        (us)      (us)     (us)
------------ ------   -------  --------  ------  ------
random       bf16      178.1     170.1    +8.0   +4.7%
random       fp8       178.2     170.3    +7.9   +4.6%
random-same  bf16      646.8     639.0    +7.7   +1.2%
random-same  fp8       647.8     640.2    +7.6   +1.2%
local-rand   bf16      177.3     172.5    +4.8   +2.8%
local-rand   fp8       181.4     177.2    +4.2   +2.4%
local-same   bf16      656.2     653.2    +3.0   +0.5%
local-same   fp8       657.8     655.1    +2.7   +0.4%
remote-rand  bf16      223.7     215.8    +8.0   +3.7%
remote-rand  fp8       226.0     217.8    +8.2   +3.8%

Benchmark script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant