[ConvertToDma] Add options to tranpose dma dimensions on target side #812

yzhang93 · 2024-10-01T21:50:25Z

Pack/unpack ops change the data layout and thus after converting to dma ops, the dma addressing dimensions are expanded/collapsed and transposed. Previously, all the dimension transpositions are on the source side of dma ops. This PR extends the usage to have an option for transposition happen on the target side.

In applications, we could make choices of transposition on source or target for pack or unpack ops based on performance and hardware dma requirements, etc. The motivation comes from this discussion, and this PR moves the dma optimization logic to an early pass where the dma ops are converted.

Note the default options are not changed in this PR (will enable it in a separate PR with other changes for dma optimization), but I have tested all four combinations locally to make sure the dma generations are correct and work e2e. The change of options can be added for example as

AMDAIEConvertToDmaOptions dmaOptions;
dmaOptions.packTransposeOnSource = false;
dmaOptions.unpackTransposeOnSource = true;
passManager.addPass(createAMDAIEConvertToDmaPass(dmaOptions));

newling · 2024-10-02T03:34:18Z

I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity).

What I have in mind is a pass

--rebalance-dma-cpy-nd

which basically converts from one form to the other. I haven't thought about how hard that is.

That said, this change is nice and small so I'm happy to accept this.

newling · 2024-10-02T03:37:25Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEConvertToDma.cpp

@@ -387,32 +403,35 @@ void AMDAIEConvertToDmaPass::runOnOperation() {
  // step. This is easy to implement, but not the most direct lowering, so
  // we might want to revisit this.
  WalkResult convertCopiesWalkResult =
-      getOperation()->walk([&rewriter](linalg::CopyOp copyOp) {
+      getOperation()->walk([&](linalg::CopyOp copyOp) {


Why this change? My preference is
[&rewriter] > [&] > [&, this]

I don't think there's need to pass &rewriter. And I'm trying to follow the coding styles in llvm-project which just pass [&] in such walking functions, e.g. https://github.com/llvm/llvm-project/blob/e1e788f423b5c780c40912ab102b0a3c4b92b9de/mlir/lib/Dialect/SCF/Transforms/ForallToFor.cpp#L63

yzhang93 · 2024-10-02T04:00:25Z

I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity).

What I have in mind is a pass
--rebalance-dma-cpy-nd
which basically converts from one form to the other. I haven't thought about how hard that is.

The goal of this PR is not trying to balance the dma dimensions. I think there's no reason to keep all transposed dimensions on the source side as the original codes does, because it is just one of the four combinations for dma generation (packTransposeOnSource, packTransposeOnTarget, unpackTransposeOnSource, unpackTransposeOnTarget). I don't see any benefit to convert the dma with all transposed dimensions on source side and then use another pass to make the source side continuous and move the strided pattern on the target side (as what I did previously in #792).

newling · 2024-10-02T05:48:52Z

The case I have in mind is a dma copy between 2 memrefs, one which is contiguous and one which is not. It might be better then to do the transposes on the side which is contiguous, because the side that is not contiguous will already need DMAs, because the dimensions can't collapse.

I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there.

jtuyls · 2024-10-02T16:49:46Z

I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there.

I am not convinced that we should create a separate pass for rebalancing. Ideally, we do it here in pack to DMA conversion as the logic seems the most straightforward at this point. The alternative of a separate pass gets quite complex as @yzhang93 pointed out: #792. To accommodate different 'strategies' within this pack to dma conversion pass we can work with different options, like what @yzhang93 did, and I think that will get us quite far already.

jtuyls

LGTM. Thanks!

Introduced and unaddressed in #812

) 1. This PR relaxes the condition for circular dma ops loop subsumption, so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even if there is other npu.dma_cpy_nd user of the same connection op after it. 2. With this change, we can further subsume loops and hoist npu.dma_cpy_nd ops out of the loop. This PR makes use of #812 and brings the dma optimizations in Passes.cpp.

[ConvertToDma] Add options to tranpose dma dimensions on target side

91aa862

yzhang93 marked this pull request as ready for review October 1, 2024 21:52

yzhang93 requested review from MaheshRavishankar, nirvedhmeshram, Abhishek-Varma and jtuyls as code owners October 1, 2024 21:52

newling reviewed Oct 2, 2024

View reviewed changes

jtuyls approved these changes Oct 2, 2024

View reviewed changes

Merge branch 'main' into strided_access_on_target_new

721e7b9

yzhang93 merged commit 5b816a5 into nod-ai:main Oct 2, 2024
6 checks passed

newling mentioned this pull request Oct 2, 2024

Don't capture this unnecessarily #818

Merged

newling added a commit that referenced this pull request Oct 2, 2024

Don't capture this unnecessarily (#818)

aae7cc6

Introduced and unaddressed in #812

This was referenced Oct 4, 2024

[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

Merged

[Performance] Optimizations for matmul #764

Open

yzhang93 mentioned this pull request Oct 18, 2024

[SplitLogicalObjectFifos] Add support for dma tranposed on the target side #850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

yzhang93 commented Oct 1, 2024 •

edited

Loading

newling commented Oct 2, 2024 •

edited

Loading

newling Oct 2, 2024 •

edited

Loading

yzhang93 Oct 2, 2024

yzhang93 commented Oct 2, 2024 •

edited

Loading

newling commented Oct 2, 2024

jtuyls commented Oct 2, 2024

jtuyls left a comment

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

Conversation

yzhang93 commented Oct 1, 2024 • edited Loading

newling commented Oct 2, 2024 • edited Loading

newling Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

yzhang93 Oct 2, 2024

Choose a reason for hiding this comment

yzhang93 commented Oct 2, 2024 • edited Loading

newling commented Oct 2, 2024

jtuyls commented Oct 2, 2024

jtuyls left a comment

Choose a reason for hiding this comment

yzhang93 commented Oct 1, 2024 •

edited

Loading

newling commented Oct 2, 2024 •

edited

Loading

newling Oct 2, 2024 •

edited

Loading

yzhang93 commented Oct 2, 2024 •

edited

Loading