multiple calls to same dispatch doesnt work #518

nirvedhmeshram · 2024-07-08T21:37:44Z

I was trying a simple example like this through the backend

!A_TYPE = tensor<256x256xf32>
!B_TYPE = tensor<256x256xf32>
!C_TYPE = tensor<256x256xf32>
func.func @matmul_small_1(%lhs : !A_TYPE,
    %rhs : !B_TYPE) -> !C_TYPE {
  %empty = tensor.empty() : !C_TYPE
  %cst = arith.constant 0.0 : f32
  %fill = linalg.fill ins(%cst : f32) outs(%empty : !C_TYPE) -> !C_TYPE
  %2 = linalg.matmul ins(%lhs, %rhs : !A_TYPE, !B_TYPE)
      outs(%fill : !C_TYPE) -> !C_TYPE
  %3 = linalg.matmul ins(%lhs, %2 : !A_TYPE, !B_TYPE)
      outs(%fill : !C_TYPE) -> !C_TYPE
  %4 = linalg.matmul ins(%lhs, %4 : !A_TYPE, !B_TYPE)
      outs(%fill : !C_TYPE) -> !C_TYPE
  return %4 : !C_TYPE
}

I discovered there is some differences in the dispatch generated when there are multiple calls to the dispatch. With a single call (which works) we get

      func.func @matmul_small_1_dispatch_0_matmul_256x256x256_f32() {
        %cst = arith.constant 0.000000e+00 : f32
        %c0 = arith.constant 0 : index
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256x256xf32>>
        %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256x256xf32>>
        %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<256x256xf32>>
        %3 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [256, 256], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256x256xf32>> -> tensor<256x256xf32>
        %4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [256, 256], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256x256xf32>> -> tensor<256x256xf32>
        %5 = tensor.empty() : tensor<256x256xf32>
        %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<256x256xf32>) -> tensor<256x256xf32>
        %7 = linalg.matmul ins(%3, %4 : tensor<256x256xf32>, tensor<256x256xf32>) outs(%6 : tensor<256x256xf32>) -> tensor<256x256xf32>
        flow.dispatch.tensor.store %7, %2, offsets = [0, 0], sizes = [256, 256], strides = [1, 1] : tensor<256x256xf32> -> !flow.dispatch.tensor<writeonly:tensor<256x256xf32>>
        return
      }

and when there are multiple calls we get

      func.func @matmul_small_1_dispatch_0_matmul_256x256x256_f32() {
       %c0 = arith.constant 0 : index
       %cst = arith.constant 0.000000e+00 : f32
       %0 = hal.interface.constant.load[0] : i32
       %1 = hal.interface.constant.load[1] : i32
       %2 = arith.index_castui %0 : i32 to index
       %3 = arith.index_castui %1 : i32 to index
       %4 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256x256xf32>>
       %5 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%2) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256x256xf32>>
       %6 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%3) : !flow.dispatch.tensor<writeonly:tensor<256x256xf32>>
       %7 = flow.dispatch.tensor.load %4, offsets = [0, 0], sizes = [256, 256], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256x256xf32>> -> tensor<256x256xf32>
       %8 = flow.dispatch.tensor.load %5, offsets = [0, 0], sizes = [256, 256], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256x256xf32>> -> tensor<256x256xf32>
       %9 = tensor.empty() : tensor<256x256xf32>
       %10 = linalg.fill ins(%cst : f32) outs(%9 : tensor<256x256xf32>) -> tensor<256x256xf32>
       %11 = linalg.matmul ins(%7, %8 : tensor<256x256xf32>, tensor<256x256xf32>) outs(%10 : tensor<256x256xf32>) -> tensor<256x256xf32>
       flow.dispatch.tensor.store %11, %6, offsets = [0, 0], sizes = [256, 256], strides = [1, 1] : tensor<256x256xf32> -> !flow.dispatch.tensor<writeonly:tensor<256x256xf32>>
       return
     }

This subtle differences causes a break in the ConvertFuncToLLVMPass pass used here

iree-amd-aie/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.cpp

Line 807 in 90ac965

pm.addPass(createConvertFuncToLLVMPass(opts));

where orginally we had

func.func @matmul_small_1_dispatch_0_matmul_256x256x256_f32(%arg0: memref<256x256xf32>, 
%arg1: memref<256x256xf32>, %arg2: memref<256x256xf32>) { ...

Now we do

func.func @matmul_small_1_dispatch_0_matmul_256x256x256_f32(%arg0: memref<256x256xf32>, 
%arg1: memref<256x256xf32, strided<[256, 1], offset: ?>>, %arg2: memref<256x256xf32, strided<[256, 1], offset: ?>>) {

which does not lower to llvm.func and hence fails in a later pass.

The text was updated successfully, but these errors were encountered:

nirvedhmeshram · 2024-07-08T22:40:36Z

Here is a link to the IR dump

…ions (#536) This is progress towards #518

nirvedhmeshram · 2024-07-19T15:09:04Z

fixed with #536 and #566

nirvedhmeshram mentioned this issue Jul 11, 2024

delete the microcontroller function after generating the npu instructions #536

Merged

nirvedhmeshram added a commit that referenced this issue Jul 11, 2024

Delete the microcontroller function after generating the npu instruct…

333012d

…ions (#536) This is progress towards #518

nirvedhmeshram closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple calls to same dispatch doesnt work #518

multiple calls to same dispatch doesnt work #518

nirvedhmeshram commented Jul 8, 2024

nirvedhmeshram commented Jul 8, 2024

nirvedhmeshram commented Jul 19, 2024

multiple calls to same dispatch doesnt work #518

multiple calls to same dispatch doesnt work #518

Comments

nirvedhmeshram commented Jul 8, 2024

nirvedhmeshram commented Jul 8, 2024

nirvedhmeshram commented Jul 19, 2024