Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joviliast
Copy link
Contributor

@joviliast joviliast commented Dec 26, 2024

Algorithm description:

  1. Sort {ptr, operand} among the threads within the warp
    via bitonic sort based on DPP and Permute operations;
  2. Distribute threads between groups defined by pointers,
    define a group for each thread by analizing neighbours
    with DPP instructions;
  3. Select master thread for each group using exec mask;
  4. Collect partial sum in LDS for each group via
    DS_ADD-like instructions
  5. Utilize global atomic operation for each group by
    master thread

Added lit test for checking algorithm highlights.

As far as described algoritm requires additional memory, size calculating
should be done in the target dependent code.
For this purpose SetSpecificAllocationSize pass was introduced, it sets
allocation.size attribute for required operation. This attribute has
highest priority during LDS size calculation in Allocation analysis.

Added a lit teest for SetSpecificAllocationSize.

Also extended test_core.py::test_tensor_atomic_rmw to be able to test cases
of interest.

@joviliast
Copy link
Contributor Author

TODO: provide testing;

Please consider https://github.com/triton-lang/triton/pull/5503/files#diff-c8636c7e4e8a1713c11e249836ccf2fe132ffc8fd85ad7054582ecad544e4a26R120-R122 .
With following approach we could move shared memory size calculation to target specific components.

@joviliast
Copy link
Contributor Author

Thanks , @scxiao for the optimization idea.

@joviliast joviliast marked this pull request as draft December 27, 2024 11:32
@joviliast joviliast force-pushed the atomic-lds-wip branch 16 times, most recently from 90f2d5c to a3de96e Compare December 30, 2024 18:37
@joviliast joviliast changed the title [WIP][AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp [AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp Dec 30, 2024
@joviliast joviliast marked this pull request as ready for review December 30, 2024 18:38
@joviliast joviliast requested a review from ptillet as a code owner December 30, 2024 18:38
@@ -117,6 +117,9 @@ ScratchConfig getScratchConfigForCvt(RankedTensorType srcTy,
}

unsigned defaultAllocationAnalysisScratchSizeFn(Operation *op) {
if (op->hasAttr("allocation.size")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to touch this function.

You can instead just create another scratchSizeFn and wrap the defaultAllocationAnalysisScratchSizeFn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think, it makes sense.

This allocation should be taken into account during all the ModuleAllocation instantiation though. Most important is common pass AllocateSharedMemory.
Would it be Ok to change defaultAllocationAnalysisScratchSizeFn to new wrapper as a default parameter for ModuleAllocation constructor?

Copy link
Contributor

@Jokeren Jokeren Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: removed my previous reply as I found the obstacle here is that Allocation has been constructed twice. Once in the general AllocateSharedMemory pass, and the other in the backend specific lowering pass. As a result scratchSizeFn has to be set consistent at both places, which is not possible unless AllocateSharedMemory is also backend aware.

Maybe extra refactor is required to make allocation more flexible.

First, each backend should have its own AllocateSharedMemory pass, but can share common functions.

Second, to avoid constructing Allocation twice. We can skip all functions with allocation.offset being set.

This approach seems more explicit than the attribute based solution described in this PR

Algorithm description:
1. Sort {ptr, operand} among the threads within the warp
   via bitonic sort based on DPP and Permute operations;
2. Distribute threads between groups defined by pointers,
   define a group for each thread by analizing neighbours
   with DPP instructions;
3. Select master thread for each group using exec mask;
4. Collect partial sum in LDS for each group via
   DS_ADD-like instructions
5. Utilize global atomic operation for each group by
   master thread

Added lit test for checking algorithm highlights.

As far as described algoritm requires additional memory, size calculating
should be done in the target dependent code.
For this purpose `SetSpecificAllocationSize` pass was introduced, it sets
`allocation.size` attribute for required operation. This attribute has
highest priority during LDS size calculation in Allocation analysis.

Added a lit teest for `SetSpecificAllocationSize`.

Also extended `test_core.py::test_tensor_atomic_rmw` to be able to test cases
of interest.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants