-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD][Atomic] Introduceruntime LDS reduction algorithm for atomicRmwOp #5503
base: main
Are you sure you want to change the base?
Conversation
1d91aed
to
28d0ea1
Compare
TODO: provide testing; Please consider https://github.com/triton-lang/triton/pull/5503/files#diff-c8636c7e4e8a1713c11e249836ccf2fe132ffc8fd85ad7054582ecad544e4a26R120-R122 . |
Thanks , @scxiao for the optimization idea. |
90f2d5c
to
a3de96e
Compare
a0afcaf
to
77eb985
Compare
@@ -117,6 +117,9 @@ ScratchConfig getScratchConfigForCvt(RankedTensorType srcTy, | |||
} | |||
|
|||
unsigned defaultAllocationAnalysisScratchSizeFn(Operation *op) { | |||
if (op->hasAttr("allocation.size")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need to touch this function.
You can instead just create another scratchSizeFn
and wrap the defaultAllocationAnalysisScratchSizeFn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think, it makes sense.
This allocation should be taken into account during all the ModuleAllocation
instantiation though. Most important is common pass AllocateSharedMemory
.
Would it be Ok to change defaultAllocationAnalysisScratchSizeFn
to new wrapper as a default parameter for ModuleAllocation
constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: removed my previous reply as I found the obstacle here is that Allocation has been constructed twice. Once in the general AllocateSharedMemory pass, and the other in the backend specific lowering pass. As a result scratchSizeFn has to be set consistent at both places, which is not possible unless AllocateSharedMemory is also backend aware.
Maybe extra refactor is required to make allocation more flexible.
First, each backend should have its own AllocateSharedMemory pass, but can share common functions.
Second, to avoid constructing Allocation twice. We can skip all functions with allocation.offset being set.
This approach seems more explicit than the attribute based solution described in this PR
77eb985
to
fb2a13e
Compare
Algorithm description: 1. Sort {ptr, operand} among the threads within the warp via bitonic sort based on DPP and Permute operations; 2. Distribute threads between groups defined by pointers, define a group for each thread by analizing neighbours with DPP instructions; 3. Select master thread for each group using exec mask; 4. Collect partial sum in LDS for each group via DS_ADD-like instructions 5. Utilize global atomic operation for each group by master thread Added lit test for checking algorithm highlights. As far as described algoritm requires additional memory, size calculating should be done in the target dependent code. For this purpose `SetSpecificAllocationSize` pass was introduced, it sets `allocation.size` attribute for required operation. This attribute has highest priority during LDS size calculation in Allocation analysis. Added a lit teest for `SetSpecificAllocationSize`. Also extended `test_core.py::test_tensor_atomic_rmw` to be able to test cases of interest. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
fb2a13e
to
74c3fe5
Compare
Algorithm description:
via bitonic sort based on DPP and Permute operations;
define a group for each thread by analizing neighbours
with DPP instructions;
DS_ADD-like instructions
master thread
Added lit test for checking algorithm highlights.
As far as described algoritm requires additional memory, size calculating
should be done in the target dependent code.
For this purpose
SetSpecificAllocationSize
pass was introduced, it setsallocation.size
attribute for required operation. This attribute hashighest priority during LDS size calculation in Allocation analysis.
Added a lit teest for
SetSpecificAllocationSize
.Also extended
test_core.py::test_tensor_atomic_rmw
to be able to test casesof interest.