Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

yzhang93
Copy link
Contributor

@yzhang93 yzhang93 commented Oct 1, 2024

Pack/unpack ops change the data layout and thus after converting to dma ops, the dma addressing dimensions are expanded/collapsed and transposed. Previously, all the dimension transpositions are on the source side of dma ops. This PR extends the usage to have an option for transposition happen on the target side.

In applications, we could make choices of transposition on source or target for pack or unpack ops based on performance and hardware dma requirements, etc. The motivation comes from this discussion, and this PR moves the dma optimization logic to an early pass where the dma ops are converted.

Note the default options are not changed in this PR (will enable it in a separate PR with other changes for dma optimization), but I have tested all four combinations locally to make sure the dma generations are correct and work e2e. The change of options can be added for example as

AMDAIEConvertToDmaOptions dmaOptions;
dmaOptions.packTransposeOnSource = false;
dmaOptions.unpackTransposeOnSource = true;
passManager.addPass(createAMDAIEConvertToDmaPass(dmaOptions));

@newling
Copy link
Contributor

newling commented Oct 2, 2024

I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity).

What I have in mind is a pass

--rebalance-dma-cpy-nd

which basically converts from one form to the other. I haven't thought about how hard that is.

That said, this change is nice and small so I'm happy to accept this.

@@ -387,32 +403,35 @@ void AMDAIEConvertToDmaPass::runOnOperation() {
// step. This is easy to implement, but not the most direct lowering, so
// we might want to revisit this.
WalkResult convertCopiesWalkResult =
getOperation()->walk([&rewriter](linalg::CopyOp copyOp) {
getOperation()->walk([&](linalg::CopyOp copyOp) {
Copy link
Contributor

@newling newling Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? My preference is
[&rewriter] > [&] > [&, this]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's need to pass &rewriter. And I'm trying to follow the coding styles in llvm-project which just pass [&] in such walking functions, e.g. https://github.com/llvm/llvm-project/blob/e1e788f423b5c780c40912ab102b0a3c4b92b9de/mlir/lib/Dialect/SCF/Transforms/ForallToFor.cpp#L63

@yzhang93
Copy link
Contributor Author

yzhang93 commented Oct 2, 2024

I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity).

What I have in mind is a pass

--rebalance-dma-cpy-nd

which basically converts from one form to the other. I haven't thought about how hard that is.

The goal of this PR is not trying to balance the dma dimensions. I think there's no reason to keep all transposed dimensions on the source side as the original codes does, because it is just one of the four combinations for dma generation (packTransposeOnSource, packTransposeOnTarget, unpackTransposeOnSource, unpackTransposeOnTarget). I don't see any benefit to convert the dma with all transposed dimensions on source side and then use another pass to make the source side continuous and move the strided pattern on the target side (as what I did previously in #792).

@newling
Copy link
Contributor

newling commented Oct 2, 2024

The case I have in mind is a dma copy between 2 memrefs, one which is contiguous and one which is not. It might be better then to do the transposes on the side which is contiguous, because the side that is not contiguous will already need DMAs, because the dimensions can't collapse.

I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there.

@jtuyls
Copy link
Contributor

jtuyls commented Oct 2, 2024

I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there.

I am not convinced that we should create a separate pass for rebalancing. Ideally, we do it here in pack to DMA conversion as the logic seems the most straightforward at this point. The alternative of a separate pass gets quite complex as @yzhang93 pointed out: #792. To accommodate different 'strategies' within this pack to dma conversion pass we can work with different options, like what @yzhang93 did, and I think that will get us quite far already.

Copy link
Contributor

@jtuyls jtuyls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@yzhang93 yzhang93 merged commit 5b816a5 into nod-ai:main Oct 2, 2024
6 checks passed
newling added a commit that referenced this pull request Oct 2, 2024
Introduced and unaddressed in
#812
yzhang93 added a commit that referenced this pull request Oct 10, 2024
)

1. This PR relaxes the condition for circular dma ops loop subsumption,
so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even
if there is other npu.dma_cpy_nd user of the same connection op after
it.
2. With this change, we can further subsume loops and hoist
npu.dma_cpy_nd ops out of the loop. This PR makes use of
#812 and brings the dma
optimizations in Passes.cpp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants