[MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels #8300

Jokeren · 2025-09-26T13:30:37Z

At context length 32768.

Before:

   ├─ 8187.845 8000.000 reduce_scatter
   │  ├─ 1157.875 100.000 _Z22ncclDevKernel_SendRecv24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 1.075 100.000 _Z31ncclDevKernel_AllGather_RING_LL24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 2.881 800.000 _ZN14at_cuda_detail3cub18DeviceReduceKernelINS0_6detail6reduce10policy_hubIiyN4cuda3std3__44plusIvEEE10Policy1000ENS0_22TransformInputIteratorIbN2at6native43_GLOBAL__N__e625f313_10_Nonzero_cu_cba1aaa09NonZeroOpIbEEPKblEEyS9_iNS7_10__identityEEEvT0_PT3_T1_NS0_13GridEvenShareISP_EET2_T4_
   │  ├─ 1.732 800.000 _ZN14at_cuda_detail3cub23DeviceCompactInitKernelINS0_13ScanTileStateIiLb1EEEPiEEvT_iT0_
   │  ├─ 6.269 800.000 _ZN14at_cuda_detail3cub23DeviceSelectSweepKernelINS0_6detail6select10policy_hubIlbiLb0ELb0ELb0EE10Policy1000ENS0_21CountingInputIteratorIllEENS0_22TransformInputIteratorIbN2at6native43_GLOBAL__N__e625f313_10_Nonzero_cu_cba1aaa09NonZeroOpIbEEPKblEEPlPiNS0_13ScanTileStateIiLb1EEENS0_8NullTypeESM_iNS3_19streaming_context_tIlLb1EEELb0ELb0EEEvT0_T1_T2_T3_T4_T5_T6_T7_iT8_NS2_7vsmem_tE
   │  ├─ 2.274 800.000 _ZN14at_cuda_detail3cub28DeviceReduceSingleTileKernelINS0_6detail6reduce10policy_hubIiyN4cuda3std3__44plusIvEEE10Policy1000EPiSC_iS9_iiNS7_10__identityEEEvT0_T1_T2_T3_T4_T6_
   │  ├─ 5.532 400.000 _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIbNS0_14func_wrapper_tIbZZZNS0_14or_kernel_cudaERNS_14TensorIteratorEENKUlvE_clEvENKUlvE10_clEvEUlbbE_EEjbLi4ELi4EEEEEvT1_
   │  ├─ 0.307 100.000 _ZN2at6native18elementwise_kernelILi128ELi2EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUllE_EEvS4_RKT_EUliE_EEviT1_
   │  ├─ 3470.640 800.000 _ZN2at6native24index_elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_21index_put_kernel_implINS0_10OpaqueTypeILi2EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESA_EUlPcPKclE_EEvRNS_18TensorIteratorBaseESA_SA_RKT_bEUliE_EEvlT1_
   │  ├─ 536.191 800.000 _ZN2at6native24vectorized_gather_kernelILi16ElEEvPcS2_PT0_illllb
   │  ├─ 2178.433 800.000 _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_St5arrayIPcLm2EELi4E23TrivialOffsetCalculatorILi1EjESD_NS0_6memory12LoadWithCastILi1EEENSE_13StoreWithCastILi1EEEEEviT_T0_T2_T3_T4_T5_
   │  ├─ 1.654 400.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_13AUnaryFunctorIiibNS0_51_GLOBAL__N__28ce311f_18_CompareEQKernel_cu_d8008c9616CompareEqFunctorIiEEEESt5arrayIPcLm2EEEEviT0_T1_
   │  ├─ 197.111 200.000 _ZN2at6native29vectorized_elementwise_kernelILi4EZNS0_23float8_copy_kernel_cudaERNS_18TensorIteratorBaseEEUlN3c104HalfEE_St5arrayIPcLm2EEEEviT0_T1_
   │  ├─ 87.883 200.000 _ZN2at6native29vectorized_elementwise_kernelILi8ENS0_11FillFunctorIN3c104HalfEEESt5arrayIPcLm1EEEEviT0_T1_
   │  ├─ 537.716 800.000 _ZN2at6native29vectorized_elementwise_kernelILi8ENS0_15CUDAFunctor_addIN3c104HalfEEESt5arrayIPcLm3EEEEviT0_T1_
   │  └─ 0.272 100.000 _ZN2at6native40_GLOBAL__N__0f1a8107_8_Shape_cu_49f7391c35CatArrayBatchedCopy_alignedK_contigINS1_10OpaqueTypeILj8EEEjLi2ELi128ELi1ELi16EEEvPT_NS1_25CatArrInputTensorMetadataIS5_T0_XT2_EXT3_EEENS1_16TensorSizeStrideIS8_Lj4EEEiS8_

After

   ├─ 1398.861 600.000 reduce_scatter
   │  ├─ 1160.684 100.000 _Z22ncclDevKernel_SendRecv24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 43.490 100.000 _Z31ncclDevKernel_AllGather_RING_LL24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 0.364 100.000 _ZN2at6native18elementwise_kernelILi128ELi2EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUllE_EEvS4_RKT_EUliE_EEviT1_
   │  ├─ 0.249 100.000 _ZN2at6native40_GLOBAL__N__0f1a8107_8_Shape_cu_49f7391c35CatArrayBatchedCopy_alignedK_contigINS1_10OpaqueTypeILj8EEEjLi2ELi128ELi1ELi16EEEvPT_NS1_25CatArrInputTensorMetadataIS5_T0_XT2_EXT3_EEENS1_16TensorSizeStrideIS8_Lj4EEEiS8_
   │  ├─ 121.925 100.000 _accumulate_ep_triton_kernel [n_tokens=262144, hidden_size=5760, TP=2, EP=2]
   │  └─ 72.149 100.000 _prepare_ep_positions_kernel

Reduce is not the bottleneck now but instead communication is.

Update Update Update Update Update Fix Update Update Update Update Lint Update Update Update Update Update Update Update Update Update Update Update Update Update Lint Update Update Update Lint Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Fix Update Fix Fix Update Revert Update Update Temporary update Update Update Update Update Update Update Update Update Fix Update Update Update Try Update Update Remove i64 Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-09-26T13:34:22Z

python/triton_kernels/bench/distributed.py

+        if position != -1:
+            row_offsets = position * hidden_size
+            col_offsets = row_offsets + offs_n
+
+            for tp_idx in tl.static_range(TP):
+                values = tl.load(
+                    input_ptrs[tl.constexpr(ep_idx * TP + tp_idx)] + col_offsets,
+                    mask=io_mask,
+                    other=0,
+                ).to(intermediate_dtype)
+                output += values


[P0] Use Triton mask instead of Python branch

The new Triton accumulation kernel attempts to gate loads with if position != -1:. position is a runtime tl.int32 tensor, so evaluating it in a Python if will raise TypeError: bool value of Tensor is ambiguous during JIT compilation. As a result the Triton kernel never compiles and the CommKernelType.TRITON path fails immediately. Use a tensor mask (tl.where/tl.load with masks) instead of Python control flow so the kernel can compile.

Useful? React with 👍 / 👎.

@codex position is a scalar

Summary

Replaced the Python-side conditional in the EP accumulation Triton kernel with tensor-based masking so the kernel compiles while skipping invalid positions during loads. python/triton_kernels/bench/distributed.pyL178-L193

Testing

⚠️ Tests not run (not requested)

View task →

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-09-27T00:48:05Z

python/triton_kernels/bench/distributed.py

+    _accumulate_ep_triton_kernel[(n_tokens, )](
+        positions,
+        output_tensor,
+        tuple(output_list),


[P0] Triton kernel called with unsupported tuple argument

The new _reduce_ep_triton path launches _accumulate_ep_triton_kernel and passes tuple(output_list) as a single kernel argument. Triton kernel parameters must be individual tensors or constexpr literals; a Python tuple is not a valid runtime value and the launch will raise a TypeError before any computation runs. As written, the TRITON communication path introduced in this commit cannot execute at all. Consider passing the tensors as separate arguments or materializing a device-side pointer array and indexing that inside the kernel.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-27T00:48:05Z

python/triton_kernels/bench/distributed.py

+    offs_m = tl.program_id(0)
+    token_mask = offs_m < n_tokens
+    offs_n = tl.arange(0, BLOCK_SIZE_N)
+    feature_mask = offs_n < hidden_size
+    io_mask = token_mask[:] & feature_mask
+    output = tl.zeros((BLOCK_SIZE_N, ), dtype=intermediate_dtype)


[P0] Scalar mask indexed as vector in Triton kernel

Inside _accumulate_ep_triton_kernel the mask is computed as io_mask = token_mask[:] & feature_mask. token_mask is a scalar (offs_m < n_tokens) so subscripting it with [:] is invalid in Triton and causes compilation to fail before the kernel can launch. This prevents the custom reduce‑scatter kernel from building. A scalar mask can be broadcast directly with token_mask & feature_mask or by constructing a vector mask of the appropriate shape.

Useful? React with 👍 / 👎.

Jokeren added 6 commits September 26, 2025 09:17

Lint

6468eae

Update

af3af64

Update

e27fa0f

Update

6f4386a

Lint

1ac60fa

Jokeren requested review from CRobeck, fywkevin and ptillet as code owners September 26, 2025 13:30

Jokeren marked this pull request as draft September 26, 2025 13:30

chatgpt-codex-connector bot reviewed Sep 26, 2025

View reviewed changes

Update

badc207

Jokeren changed the title ~~[DRAFT][MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels~~ [MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels Sep 27, 2025

Jokeren marked this pull request as ready for review September 27, 2025 00:44

chatgpt-codex-connector bot reviewed Sep 27, 2025

View reviewed changes

Jokeren added 2 commits September 26, 2025 21:04

Update

aa560de

Fix

1f1b34a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels #8300

[MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels #8300

Uh oh!

Jokeren commented Sep 26, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Sep 26, 2025

Uh oh!

Jokeren Sep 26, 2025

Uh oh!

chatgpt-codex-connector bot Sep 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Sep 27, 2025

Uh oh!

chatgpt-codex-connector bot Sep 27, 2025

Uh oh!

Uh oh!

[MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels #8300

Are you sure you want to change the base?

[MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels #8300

Uh oh!

Conversation

Jokeren commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Jokeren Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jokeren commented Sep 26, 2025 •

edited

Loading