Fix: Add fence.proxy.async before TMA load in combine recv phase by elvircrn · Pull Request #632 · deepseek-ai/DeepEP

elvircrn · 2026-05-11T08:44:32Z

Background

Correctness issues with repeated runs of combine at bath sizes higher than 256 were found and issue was reported.

An investigation was done revealing doing __ldg on the buffer ptr instead of TMA make combine produce consistent results at bach size > 256 on NVL72. A plausible explanation for this is that switching to generic proxy resolves the issue.

Investigating further revealed that the local rank polls the generic proxy but then uses TMA to load data. Without fence.proxy.async, TMA can see stale data.

From https://docs.nvidia.com/cuda/parallel-thread-execution/#async-proxy:

Accessing the same memory location across multiple proxies needs a cross-proxy fence. 
For the async proxy, fence.proxy.async should be used to synchronize memory between 
generic proxy and the async proxy.

With the fence in place, the test scripts made as part of this report correct behavior.

Fence Performance Impact

low-latency benchmark (median of 3 rounds, 1000 iters each)
  T=1024  H=7168  E=256  topk=8  ranks=4
  worst-case rank (max across ranks per round)

  strategies:
    random      = topk over random scores, different experts per token (cross-rank RDMA)
    random-same = topk over random scores, same K experts for all tokens (cross-rank RDMA)
    local-rand  = random offsets mapped to local rank experts (no RDMA, varied experts)
    local-same  = every token picks the same K local experts (no RDMA, hotspot)
    remote-rand = random experts, each from a different remote rank (all cross-rank RDMA)

strategy     fmt       cleg_on  cleg_off   delta     pct
                        (us)      (us)     (us)
------------ ------   -------  --------  ------  ------
random       bf16      178.1     170.1    +8.0   +4.7%
random       fp8       178.2     170.3    +7.9   +4.6%
random-same  bf16      646.8     639.0    +7.7   +1.2%
random-same  fp8       647.8     640.2    +7.6   +1.2%
local-rand   bf16      177.3     172.5    +4.8   +2.8%
local-rand   fp8       181.4     177.2    +4.2   +2.4%
local-same   bf16      656.2     653.2    +3.0   +0.5%
local-same   fp8       657.8     655.1    +2.7   +0.4%
remote-rand  bf16      223.7     215.8    +8.0   +3.7%
remote-rand  fp8       226.0     217.8    +8.2   +3.8%

Benchmark script.

…rrectness for batch size > 256 on NVL72

Add fence.proxy.async before TMA load in combine recv phase. Fixes co…

d7d1cdc

…rrectness for batch size > 256 on NVL72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Add fence.proxy.async before TMA load in combine recv phase#632

Fix: Add fence.proxy.async before TMA load in combine recv phase#632
elvircrn wants to merge 1 commit into
deepseek-ai:mainfrom
elvircrn:correctness-fix

elvircrn commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elvircrn commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Fence Performance Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elvircrn commented May 11, 2026 •

edited

Loading