Skip to content

Conversation

Jokeren
Copy link
Contributor

@Jokeren Jokeren commented Sep 26, 2025

At context length 32768.

Before:

   ├─ 8187.845 8000.000 reduce_scatter
   │  ├─ 1157.875 100.000 _Z22ncclDevKernel_SendRecv24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 1.075 100.000 _Z31ncclDevKernel_AllGather_RING_LL24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 2.881 800.000 _ZN14at_cuda_detail3cub18DeviceReduceKernelINS0_6detail6reduce10policy_hubIiyN4cuda3std3__44plusIvEEE10Policy1000ENS0_22TransformInputIteratorIbN2at6native43_GLOBAL__N__e625f313_10_Nonzero_cu_cba1aaa09NonZeroOpIbEEPKblEEyS9_iNS7_10__identityEEEvT0_PT3_T1_NS0_13GridEvenShareISP_EET2_T4_
   │  ├─ 1.732 800.000 _ZN14at_cuda_detail3cub23DeviceCompactInitKernelINS0_13ScanTileStateIiLb1EEEPiEEvT_iT0_
   │  ├─ 6.269 800.000 _ZN14at_cuda_detail3cub23DeviceSelectSweepKernelINS0_6detail6select10policy_hubIlbiLb0ELb0ELb0EE10Policy1000ENS0_21CountingInputIteratorIllEENS0_22TransformInputIteratorIbN2at6native43_GLOBAL__N__e625f313_10_Nonzero_cu_cba1aaa09NonZeroOpIbEEPKblEEPlPiNS0_13ScanTileStateIiLb1EEENS0_8NullTypeESM_iNS3_19streaming_context_tIlLb1EEELb0ELb0EEEvT0_T1_T2_T3_T4_T5_T6_T7_iT8_NS2_7vsmem_tE
   │  ├─ 2.274 800.000 _ZN14at_cuda_detail3cub28DeviceReduceSingleTileKernelINS0_6detail6reduce10policy_hubIiyN4cuda3std3__44plusIvEEE10Policy1000EPiSC_iS9_iiNS7_10__identityEEEvT0_T1_T2_T3_T4_T6_
   │  ├─ 5.532 400.000 _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIbNS0_14func_wrapper_tIbZZZNS0_14or_kernel_cudaERNS_14TensorIteratorEENKUlvE_clEvENKUlvE10_clEvEUlbbE_EEjbLi4ELi4EEEEEvT1_
   │  ├─ 0.307 100.000 _ZN2at6native18elementwise_kernelILi128ELi2EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUllE_EEvS4_RKT_EUliE_EEviT1_
   │  ├─ 3470.640 800.000 _ZN2at6native24index_elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_21index_put_kernel_implINS0_10OpaqueTypeILi2EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESA_EUlPcPKclE_EEvRNS_18TensorIteratorBaseESA_SA_RKT_bEUliE_EEvlT1_
   │  ├─ 536.191 800.000 _ZN2at6native24vectorized_gather_kernelILi16ElEEvPcS2_PT0_illllb
   │  ├─ 2178.433 800.000 _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_St5arrayIPcLm2EELi4E23TrivialOffsetCalculatorILi1EjESD_NS0_6memory12LoadWithCastILi1EEENSE_13StoreWithCastILi1EEEEEviT_T0_T2_T3_T4_T5_
   │  ├─ 1.654 400.000 _ZN2at6native29vectorized_elementwise_kernelILi4ENS0_13AUnaryFunctorIiibNS0_51_GLOBAL__N__28ce311f_18_CompareEQKernel_cu_d8008c9616CompareEqFunctorIiEEEESt5arrayIPcLm2EEEEviT0_T1_
   │  ├─ 197.111 200.000 _ZN2at6native29vectorized_elementwise_kernelILi4EZNS0_23float8_copy_kernel_cudaERNS_18TensorIteratorBaseEEUlN3c104HalfEE_St5arrayIPcLm2EEEEviT0_T1_
   │  ├─ 87.883 200.000 _ZN2at6native29vectorized_elementwise_kernelILi8ENS0_11FillFunctorIN3c104HalfEEESt5arrayIPcLm1EEEEviT0_T1_
   │  ├─ 537.716 800.000 _ZN2at6native29vectorized_elementwise_kernelILi8ENS0_15CUDAFunctor_addIN3c104HalfEEESt5arrayIPcLm3EEEEviT0_T1_
   │  └─ 0.272 100.000 _ZN2at6native40_GLOBAL__N__0f1a8107_8_Shape_cu_49f7391c35CatArrayBatchedCopy_alignedK_contigINS1_10OpaqueTypeILj8EEEjLi2ELi128ELi1ELi16EEEvPT_NS1_25CatArrInputTensorMetadataIS5_T0_XT2_EXT3_EEENS1_16TensorSizeStrideIS8_Lj4EEEiS8_

After

   ├─ 1398.861 600.000 reduce_scatter
   │  ├─ 1160.684 100.000 _Z22ncclDevKernel_SendRecv24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 43.490 100.000 _Z31ncclDevKernel_AllGather_RING_LL24ncclDevKernelArgsStorageILm4096EE
   │  ├─ 0.364 100.000 _ZN2at6native18elementwise_kernelILi128ELi2EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUllE_EEvS4_RKT_EUliE_EEviT1_
   │  ├─ 0.249 100.000 _ZN2at6native40_GLOBAL__N__0f1a8107_8_Shape_cu_49f7391c35CatArrayBatchedCopy_alignedK_contigINS1_10OpaqueTypeILj8EEEjLi2ELi128ELi1ELi16EEEvPT_NS1_25CatArrInputTensorMetadataIS5_T0_XT2_EXT3_EEENS1_16TensorSizeStrideIS8_Lj4EEEiS8_
   │  ├─ 121.925 100.000 _accumulate_ep_triton_kernel [n_tokens=262144, hidden_size=5760, TP=2, EP=2]
   │  └─ 72.149 100.000 _prepare_ep_positions_kernel

Reduce is not the bottleneck now but instead communication is.

Update

Update

Update

Update

Update

Fix

Update

Update

Update

Update

Lint

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Lint

Update

Update

Update

Lint

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Fix

Update

Fix

Fix

Update

Revert

Update

Update

Temporary update

Update

Update

Update

Update

Update

Update

Update

Update

Fix

Update

Update

Update

Try

Update

Update

Remove i64

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update

Update
@Jokeren Jokeren marked this pull request as draft September 26, 2025 13:30
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 183 to 193
if position != -1:
row_offsets = position * hidden_size
col_offsets = row_offsets + offs_n

for tp_idx in tl.static_range(TP):
values = tl.load(
input_ptrs[tl.constexpr(ep_idx * TP + tp_idx)] + col_offsets,
mask=io_mask,
other=0,
).to(intermediate_dtype)
output += values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Use Triton mask instead of Python branch

The new Triton accumulation kernel attempts to gate loads with if position != -1:. position is a runtime tl.int32 tensor, so evaluating it in a Python if will raise TypeError: bool value of Tensor is ambiguous during JIT compilation. As a result the Triton kernel never compiles and the CommKernelType.TRITON path fails immediately. Use a tensor mask (tl.where/tl.load with masks) instead of Python control flow so the kernel can compile.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex position is a scalar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Testing

  • ⚠️ Tests not run (not requested)

View task →

@Jokeren Jokeren changed the title [DRAFT][MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels [MULTI-GPU] Optimize reduce_scatter (except all-to-all) using custom triton kernels Sep 27, 2025
@Jokeren Jokeren marked this pull request as ready for review September 27, 2025 00:44
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines +232 to +235
_accumulate_ep_triton_kernel[(n_tokens, )](
positions,
output_tensor,
tuple(output_list),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Triton kernel called with unsupported tuple argument

The new _reduce_ep_triton path launches _accumulate_ep_triton_kernel and passes tuple(output_list) as a single kernel argument. Triton kernel parameters must be individual tensors or constexpr literals; a Python tuple is not a valid runtime value and the launch will raise a TypeError before any computation runs. As written, the TRITON communication path introduced in this commit cannot execute at all. Consider passing the tensors as separate arguments or materializing a device-side pointer array and indexing that inside the kernel.

Useful? React with 👍 / 👎.

Comment on lines 170 to 175
offs_m = tl.program_id(0)
token_mask = offs_m < n_tokens
offs_n = tl.arange(0, BLOCK_SIZE_N)
feature_mask = offs_n < hidden_size
io_mask = token_mask[:] & feature_mask
output = tl.zeros((BLOCK_SIZE_N, ), dtype=intermediate_dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Scalar mask indexed as vector in Triton kernel

Inside _accumulate_ep_triton_kernel the mask is computed as io_mask = token_mask[:] & feature_mask. token_mask is a scalar (offs_m < n_tokens) so subscripting it with [:] is invalid in Triton and causes compilation to fail before the kernel can launch. This prevents the custom reduce‑scatter kernel from building. A scalar mask can be broadcast directly with token_mask & feature_mask or by constructing a vector mask of the appropriate shape.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant