[Performance] Optimizations for matmul #764

yzhang93 · 2024-09-11T20:35:23Z

This issue is used as a tracker for ideas and discussions to improve performance for matmul ops. The data type for all these matmuls is bf16.

Some existing ideas include:

Increase L1/L2 tile sizes and tweak existing double buffer and pipelining if necessary (some experiment results in the next comment).
Optimize control codes to move some dma instructions from L3 to L2 side.
Split buffers and make use of more memTiles and channels.
Use 4x4 or 4x8 instead of 2x2. This depends on point 3 as buffers need to be split across different memtiles to make this functional.

@jtuyls Feel free to add more points and details.

yzhang93 · 2024-09-11T22:12:41Z

For the first point, I didn't see significant performance change after changing the single buffer to double buffer. However, the performance increases significantly if the L1/L2 sizes are increased (has to use the single buffer to avoid exceeding the memory bound).

Here are some comparison results on the matmul shapes from VAE. The execution time is the average of 10 runs.

Current parameter settings:
L2 depth = 2, L1 depth = 2
L2 tile size = 64, L1 tile size = 32

Dispatch Type	Shape	dtype	Compilation Time [ms]	Execution Time [ms] (Phoenix)
matmul	256x65536x512	bf16	18799	1268.7
matmul	128x262144x256	bf16	20865	1668.5
matmul_transpose_b	4096x512x512	bf16	1917	153.7

Now use single buffer:
L2 depth = 1, L1 depth = 1
L2 tile size = 64, L1 tile size = 32

Dispatch Type	Shape	dtype	Compilation Time [ms]	Execution Time [ms] (Phoenix)
matmul	256x65536x512	bf16	17863	1269.5
matmul	128x262144x256	bf16	19951	1669.8
matmul_transpose_b	4096x512x512	bf16	1260	169

Now increase tile sizes:
L2 depth = 1, L1 depth = 1
L2 tile size = 128, L1 tile size = 64

Dispatch Type	Shape	dtype	Compilation Time [ms]	Execution Time [ms] (Phoenix)
matmul	256x65536x512	bf16	4624	765.5
matmul	128x262144x256	bf16	3080	1198.8
matmul_transpose_b	4096x512x512	bf16	975	148

jtuyls · 2024-09-13T16:00:19Z

To add more details on 2), see for example this piece of control code for a 128x128x128 matmul after the DmaComposition pass:

scf.forall (%arg0, %arg1) in (2, 2) {
  %41 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
  %42 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
  ...
  %45 = amdaie.npu.circular_dma_cpy_nd %8([0] [2048] [1], [] [] [])
  %46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0, %41] [4, 2, 32, 32] [4096, 32, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
  amdaie.npu.dma_wait(%46, MM2S)
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Here %46 has 4 dimensions on the source side and as this is the limit, the loop iteration (see dependency through %41) can't be subsumed into the DMA's source dimensions anymore. However, some of the dimensions on the source side could potentially be moved to the target side (which currently has a linear write access pattern as can be seen in %45). This would typically result in a larger read by the source DMA port and the target DMA port would then take care of writing the result in the expected blocked format, with resulting IR:

scf.forall (%arg0, %arg1) in (2, 2) {
  %41 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
  %42 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
  ...
  %45 = amdaie.npu.circular_dma_cpy_nd %8([0, 0, 0, 0] [4, 32, 2, 32] [2048, 32, 1024, 1], [] [] [])
  %46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %41] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
  amdaie.npu.dma_wait(%46, MM2S)
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Or after canonicalization:

scf.forall (%arg0, %arg1) in (2, 2) {
  %41 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg1)
  %42 = affine.apply affine_map<(d0) -> (d0 * 64)>(%arg0)
  ...
  %45 = amdaie.npu.circular_dma_cpy_nd %8([0, 0, 0, 0] [4, 32, 2, 32] [2048, 32, 1024, 1], [] [] [])
  %46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %41] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
  amdaie.npu.dma_wait(%46, MM2S)
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

After this transformation, the source access pattern is left with only 2 dimensions, so now the DmaLoopSubsumption transformation can be applied again to reduce the number of NPU instructions:

%45 = amdaie.npu.circular_dma_cpy_nd %8([0, 0, 0, 0] [4, 32, 2, 32] [2048, 32, 1024, 1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0] [2, 128, 64] [64, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
scf.forall (%arg0, %arg1) in (2, 2) {
  ...
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
amdaie.npu.dma_wait(%46, MM2S)

newling · 2024-09-19T17:39:18Z

Optimizing 2 would also reduce compile time. For the larger matmul above the pass AMDAIEControlCodeLoopUnroll creates O(1e5) operations, and everything thereafter is slow (canonicalization takes O(1) seconds).

yzhang93 · 2024-10-01T05:03:35Z

I have point 2) optimized and working correctly for most shapes. However, the tests with large k size (>=1024) have numerics issue. Here's a simplified version of codes (with just L3 to L2 dma addressing change) I made for testing purpose #809.

Note if I disable the second LoopSubsumptionPass(/DmaComposition), then all the tests pass, which means the changes within convert-to-dma work without problem. The problem seems to happen in LoopSubsumptionPass (maybe the changes I made to relax the npu.circular_dma constraint)?

Here's the IR dump for 128x128x256 (worked) and 128x128x1024 (failed) for comparison.

@jtuyls do you have any idea about this?

UPDATE: This is currently solved by not subsuming loop iterations for large K size (>=1024) since it would exceed the size limit after inserting new dimensions.

…812) Pack/unpack ops change the data layout and thus after converting to dma ops, the dma addressing dimensions are expanded/collapsed and transposed. Previously, all the dimension transpositions are on the source side of dma ops. This PR extends the usage to have an option for transposition happen on the target side. In applications, we could make choices of transposition on source or target for pack or unpack ops based on performance and hardware dma requirements, etc. The motivation comes from [this discussion](#764 (comment)), and this PR moves the dma optimization logic to an early pass where the dma ops are converted. Note the default options are not changed in this PR (will enable it in a separate PR with other changes for dma optimization), but I have tested all four combinations locally to make sure the dma generations are correct and work e2e. The change of options can be added for example as ``` AMDAIEConvertToDmaOptions dmaOptions; dmaOptions.packTransposeOnSource = false; dmaOptions.unpackTransposeOnSource = true; passManager.addPass(createAMDAIEConvertToDmaPass(dmaOptions)); ```

yzhang93 mentioned this issue Sep 20, 2024

Pass to transfer the strided access pattern from L3 to L2 #792

Closed

yzhang93 mentioned this issue Oct 1, 2024

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

Merged

yzhang93 self-assigned this Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Optimizations for matmul #764

[Performance] Optimizations for matmul #764

yzhang93 commented Sep 11, 2024 •

edited

Loading

yzhang93 commented Sep 11, 2024 •

edited by jtuyls

Loading

jtuyls commented Sep 13, 2024

newling commented Sep 19, 2024

yzhang93 commented Oct 1, 2024 •

edited

Loading

[Performance] Optimizations for matmul #764

[Performance] Optimizations for matmul #764

Comments

yzhang93 commented Sep 11, 2024 • edited Loading

yzhang93 commented Sep 11, 2024 • edited by jtuyls Loading

jtuyls commented Sep 13, 2024

newling commented Sep 19, 2024

yzhang93 commented Oct 1, 2024 • edited Loading

yzhang93 commented Sep 11, 2024 •

edited

Loading

yzhang93 commented Sep 11, 2024 •

edited by jtuyls

Loading

yzhang93 commented Oct 1, 2024 •

edited

Loading