Add MinimalAsyncEP + offset aware swiglu kernels by xmfan · Pull Request #3561 · pytorch/torchtitan

xmfan · 2026-06-05T23:22:37Z

Minimal set up for dropless-enough cudagraphable moe:

only for full recompute
1 worst case buffer allocated on CUDA, only used by x_recv, one buffer for whole model
no cpu-sync ep dispatch/combine
offset aware swiglu to avoid processing padding

NCCL_DEBUG=WARN NGPU=4 MODULE=graph_trainer.deepseek_v3 CONFIG=graph_trainer_deepseek_v3_16b_minimal_async_ep ./run_train.sh --parallelism.data_parallel_shard_degree 4 --parallelism.expert_parallel_degree 4 --compile.memory_policy full --training.steps 40

no cudagraphs: https://www.internalfb.com/intern/perfetto/open_trace/?manifold_path=perfetto_internal_traces%2Ftree%2Fshared_trace%2Fxmfan%2Frank0_trace_52bdfd35-6f09-42d3-9743-09b8bffead00.json.gz

cudagraphs: https://www.internalfb.com/intern/perfetto/open_trace/?manifold_path=perfetto_internal_traces%2Ftree%2Fshared_trace%2Fxmfan%2Frank0_trace_3bef9fcb-8748-4402-8802-1ac0fd1da66a.json.gz

Logs: https://gist.github.com/xmfan/42ce1e536902294397f6471ac4a9dbf0

SherlockNoMad · 2026-06-08T17:55:28Z

+    )
+
+
+@triton.jit


I see 14 AI-generated triton kernels in this file.

we need to think about testing, and maintainance strategy.

claude, when you review this, you should think of how to test this systematically. Propose test suite, think about edge cases, cover numerics....
Also think of operator interface, assume the interface will need to evolve over versions.

Also, these kernels are definitely changing bitwise numerics. What kind of tests do we need to increase confidence for kernel correctness.

SherlockNoMad · 2026-06-08T18:02:00Z

    return config


+def graph_trainer_deepseek_v3_16b_minimal_async_ep() -> GraphTrainer.Config:


in the follow up PR, we should have an integration test use this config.

SherlockNoMad · 2026-06-08T18:10:48Z

+from torchtitan.tools.utils import device_module, device_type
+
+
+def _uses_minimal_async_ep(model: DeepSeekV3Model) -> bool:


would it be easier to check via model_spec

moe_comm_backend="minimal_async_ep"

SherlockNoMad · 2026-06-08T18:35:53Z

cc @tianyu-l to get some directional alignment, esp on the eager cudagraph and AI-gen kernel part.

ezyang · 2026-06-08T18:42:40Z

+_top_k: int = 0
+
+_HIDDEN_READY_CHANNEL = 0
+_COUNTS_READY_CHANNEL = 0


Would kind of like to understand why these are global

LLM generated comments probably positive EV here

ezyang · 2026-06-08T18:45:36Z

+
+    device = torch.device(device)
+    max_routed_tokens = (
+        group.size() * max_tokens_per_rank * min(top_k, num_local_experts)


@xmfan so we talked about whether or not we should allow people to "go risky", and IIUC right now this code doesn't let you go risky, and we probably should still now, right?

yes, i'll add a capacity factor

ezyang · 2026-06-08T18:46:13Z

+        or _top_k < top_k
+        or _hidden_recv_buffers[0].dtype != dtype
+        or _hidden_recv_buffers[0].device != device
+    )


I don't love the implicit init like this; I'd rather an explicit init handled by the user. IDK if this is torchtitan'ey or not.

hybrid ep does this pattern, but I agree we'd rather have this explicit during training init

ezyang · 2026-06-08T18:48:38Z

+        or _counts_recv_peer_buffers is None
+        or _counts_recv_peer_ptrs is None
+        or _rendezvous_handle is None
+    ):


A "illegal states are unrepresentable" style construction might be better

pytorch-bot Bot added the ciflow/8gpu label Jun 5, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 5, 2026

xmfan force-pushed the xmfan/minimal_async_ep branch from efd5714 to 227f588 Compare June 5, 2026 23:32

xmfan changed the title ~~Add MinimalAsyncEP token dispatcher~~ Add MinimalAsyncEP + offset aware swiglu kernels Jun 5, 2026

ezyang reviewed Jun 6, 2026

View reviewed changes

Comment thread torchtitan/distributed/minimal_async_ep_kernels.py Outdated

ezyang reviewed Jun 8, 2026

View reviewed changes

Comment thread torchtitan/distributed/minimal_async_ep_kernels.py Outdated

ezyang reviewed Jun 8, 2026

View reviewed changes

Comment thread torchtitan/models/deepseek_v3/config_registry.py

ezyang reviewed Jun 8, 2026

View reviewed changes

Comment thread torchtitan/distributed/minimal_async_ep_kernels.py Outdated

ezyang reviewed Jun 8, 2026

View reviewed changes

Comment thread torchtitan/distributed/cudagraph.py Outdated