[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503

joviliast · 2024-12-26T17:03:11Z

Algorithm description:

Sort {ptr, operand} among the threads within the warp
via bitonic sort based on DPP and Permute operations;
Distribute threads between groups defined by pointers,
define a group for each thread by analizing neighbours
with DPP instructions;
Select master thread for each group using exec mask;
Collect partial sum in LDS for each group via
DS_ADD-like instructions
Utilize global atomic operation for each group by
master thread

Added lit test for checking algorithm highlights.

As far as described algoritm requires additional memory, size calculating
should be done in the target dependent code.
For this purpose SetSpecificAllocationSize pass was introduced, it sets
allocation.size attribute for required operation. This attribute has
highest priority during LDS size calculation in Allocation analysis.

Added a lit teest for SetSpecificAllocationSize.

Also extended test_core.py::test_tensor_atomic_rmw to be able to test cases
of interest.

joviliast · 2024-12-26T17:10:26Z

TODO: provide testing;

Please consider https://github.com/triton-lang/triton/pull/5503/files#diff-c8636c7e4e8a1713c11e249836ccf2fe132ffc8fd85ad7054582ecad544e4a26R120-R122 .
With following approach we could move shared memory size calculation to target specific components.

joviliast · 2024-12-26T17:12:10Z

Thanks , @scxiao for the optimization idea.

Jokeren · 2024-12-30T18:42:21Z

lib/Analysis/Allocation.cpp

@@ -117,6 +117,9 @@ ScratchConfig getScratchConfigForCvt(RankedTensorType srcTy,
 }

 unsigned defaultAllocationAnalysisScratchSizeFn(Operation *op) {
+  if (op->hasAttr("allocation.size")) {


I think you don't need to touch this function.

You can instead just create another scratchSizeFn and wrap the defaultAllocationAnalysisScratchSizeFn

Yep, I think, it makes sense.

This allocation should be taken into account during all the ModuleAllocation instantiation though. Most important is common pass AllocateSharedMemory.
Would it be Ok to change defaultAllocationAnalysisScratchSizeFn to new wrapper as a default parameter for ModuleAllocation constructor?

Update: removed my previous reply as I found the obstacle here is that Allocation has been constructed twice. Once in the general AllocateSharedMemory pass, and the other in the backend specific lowering pass. As a result scratchSizeFn has to be set consistent at both places, which is not possible unless AllocateSharedMemory is also backend aware.

Maybe extra refactor is required to make allocation more flexible.

First, each backend should have its own AllocateSharedMemory pass, but can share common functions.

Second, to avoid constructing Allocation twice. We can skip all functions with allocation.offset being set.

This approach seems more explicit than the attribute based solution described in this PR

Algorithm description: 1. Sort {ptr, operand} among the threads within the warp via bitonic sort based on DPP and Permute operations; 2. Distribute threads between groups defined by pointers, define a group for each thread by analizing neighbours with DPP instructions; 3. Select master thread for each group using exec mask; 4. Collect partial sum in LDS for each group via DS_ADD-like instructions 5. Utilize global atomic operation for each group by master thread Added lit test for checking algorithm highlights. As far as described algoritm requires additional memory, size calculating should be done in the target dependent code. For this purpose `SetSpecificAllocationSize` pass was introduced, it sets `allocation.size` attribute for required operation. This attribute has highest priority during LDS size calculation in Allocation analysis. Added a lit teest for `SetSpecificAllocationSize`. Also extended `test_core.py::test_tensor_atomic_rmw` to be able to test cases of interest. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>

joviliast requested review from antiagainst, zhanglx13 and Jokeren as code owners December 26, 2024 17:03

joviliast force-pushed the atomic-lds-wip branch from 1d91aed to 28d0ea1 Compare December 26, 2024 17:04

joviliast marked this pull request as draft December 27, 2024 11:32

joviliast force-pushed the atomic-lds-wip branch 16 times, most recently from 90f2d5c to a3de96e Compare December 30, 2024 18:37

joviliast changed the title ~~[WIP][AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp~~ [AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp Dec 30, 2024

joviliast marked this pull request as ready for review December 30, 2024 18:38

joviliast requested a review from ptillet as a code owner December 30, 2024 18:38

joviliast force-pushed the atomic-lds-wip branch from a0afcaf to 77eb985 Compare December 30, 2024 18:40

Jokeren reviewed Dec 30, 2024

View reviewed changes

joviliast force-pushed the atomic-lds-wip branch from 77eb985 to fb2a13e Compare December 31, 2024 09:56

joviliast force-pushed the atomic-lds-wip branch from fb2a13e to 74c3fe5 Compare December 31, 2024 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503

[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503

joviliast commented Dec 26, 2024 •

edited

Loading

joviliast commented Dec 26, 2024

joviliast commented Dec 26, 2024

Jokeren Dec 30, 2024

joviliast Dec 31, 2024

Jokeren Jan 1, 2025 •

edited

Loading

[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503

Are you sure you want to change the base?

[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503

Conversation

joviliast commented Dec 26, 2024 • edited Loading

joviliast commented Dec 26, 2024

joviliast commented Dec 26, 2024

Jokeren Dec 30, 2024

Choose a reason for hiding this comment

joviliast Dec 31, 2024

Choose a reason for hiding this comment

Jokeren Jan 1, 2025 • edited Loading

Choose a reason for hiding this comment

joviliast commented Dec 26, 2024 •

edited

Loading

Jokeren Jan 1, 2025 •

edited

Loading