[RFC] Plan to introduce packed_matmul op and packing_map attr for flexible packing matmul representation in upstream mlir

## Motivation

Current linalg dialect only supports fixed input/output shapes for structured named matmul ops, if we want to define different packings for matmul op in existing framework, we have to add each one of them separately. Noticeably, there are a lot of packing method for matmuls and a lot of them are related to hardwares(BFMMLA,VNNI), even some simple packings requires more than a few ops to cover them, e.g. linalg.mmt4d is for (MNmn += Mkmk * NKnk), but for (MNmn += Mkmk * NKkn), etc, we there is no named op to represent. Fundamentally, we also want to avoid adding every single op for each combination.

Thus this post is presenting a new flexible linalg.packed_matmul op using linalg.packing_map attr to map different matmul packing that will help developer utilize named ops to represent matmuls to there needs. 

## Background

For a simple matmul with input/output shape: (M,N)+=(M,K)\*(K,N), dims: A(0,1), B(0,1), C(0,1)
M is mapped as A(0) -> C(0)
N is mapped as B(1) -> C(1)
K is mapped as A(1) -> B(0)

3 iterations is required to carry out the calculation:
```cpp
for (m : range(M))
  for (n : range(N))
    for (k : range(K))
      C[m, n] += A[m, k] + B[k, n]
```

For a packed matmul, it is known the 3 dimension M,N,K is presented inside the input/output shapes and if the mapping between 3 kind of dims are determined, the calculation can be finalized. 

For example a VNNI packed matmul can be represent as: A(M,N,M0,N0) += A(M,K,M0,K01)\*B(N,K,K0,N0,K1), 
So for dims: A(0,1,2,3), B(0,1,2,3,4), C(0,1,2,3)
M is mapped as A(0) -> C(0); A(2) -> C(2)
N is mapped as B(0) -> C(1); B(3) -> C(3)
K is mapped as A(1) -> B(1); A(3) -> B(2, 4)

In this case, 7 iterations is required to carry out the calculation and can be represent as: 
```cpp
for (m : range(M))
  for (n : range(N))
    for (m0 : range(M0))
      for (n0 : range(N0))
        for (k : range(K))
          for (k0 : range(K0))
            for (k1 : range(K1))
              C[m, n, m0, n0] += A[m, k, m0, k0 * K1 + k1] + B[n, k, k0, n0, k1]
```

## The linalg.packing_map Attr

e.g. in A[a,b] -> B[x,y,z], if dim [a] corresponding to dim [x]; dim [b] corresponding to packed dims [y,z].
We can express it as `linalg.packing_map<[a] -> [x]>`, `linalg.packing_map<[b] -> [y,z]>`, where dims mapping order is A -> B

`linalg.packing_map` is defined as an attr, it requires 2 int64 arrays params to represent a mapping between 2 sets of sorted indices.
It is needed to verify that one of them contain only 1 index, since multi-dims to multi-dims mapping is not allowed.
This will define a 1->N index set mapping, src is the 1 index, dst is the multi-dims index list.
Some helpers are provided to get the mapping order(first<-second or first->second) and mapping src/dst indices.

```tablegen
def PackingMapAttr : Linalg_Attr<"PackingMap", "packing_map"> {
  let parameters = (ins ArrayRefParameter<"uint64_t">:$first,
                        ArrayRefParameter<"uint64_t">:$second);

  let assemblyFormat = "`<` `[` $first `]` `->` `[` $second `]` `>`";

  let extraClassDeclaration = [{
    /// Index first is 0; Index second is 1
    unsigned getPackingSrcIndex();
    unsigned getPackingDstIndex();
    /// SrcDims.size() == 1; DstDims.size() >= 1
    ArrayRef<uint64_t> getPackingSrcDims();
    ArrayRef<uint64_t> getPackingDstDims();
  }];
}
```

## The linalg.packed_matmul Op

`linalg.packed_matmul` is defined as a structured named op with LinalgOp Interface, it requires 2 input (A, B) and 1 init (C), the body computes `u5(u1(c) + u2(u3(a) * u4(b)))` .
It also requires 3 attrs: m_packing, n_packing, k_packing. Each of them is a list of `linalg.packing_map` attr to map every type of matrix dimension, each attr in this list is to represent a mapping between 2 sets of indices from matrix A, B, C. 
We define mapping order as: m_packing A->C, n_packing B->C, k_packing A->B

For the VNNI packed matmul example above, we can express it as:
```mlir
#m_packing_vnni = [#linalg.packing_map<[0] -> [0]>, #linalg.packing_map<[2] -> [2]>]
#n_packing_vnni = [#linalg.packing_map<[0] -> [1]>, #linalg.packing_map<[3] -> [3]>]
#k_packing_vnni = [#linalg.packing_map<[1] -> [1]>, #linalg.packing_map<[3] -> [2, 4]>]
func.func @packed_matmul(%A: tensor<2x8x32x32xbf16>, %B: tensor<4x8x16x32x2xbf16>, 
                      %C: tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16> {
  %0 = linalg.packed_matmul 
        {m_packing = #m_packing_vnni, n_packing = #n_packing_vnni, k_packing = #k_packing_vnni}
        ins(%A, %B : tensor<2x8x32x32xbf16>, tensor<4x8x16x32x2xbf16>)
        outs(%C : tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16>
  return %0 : tensor<2x4x32x32xbf16>
}
```

## How to verify

Since the mapping is explicit, these are the criteria to verify this op:

1. linalg.packed_matmul input/output must have rank
2. all dims inside linalg.packing_map must be permutation of all of dims of A,B,C
3. all of mapping dims must match size, i.e. in 1->N index set mapping, product of N-dims size equals the 1-dim size
4. dynamic dims are ignored when matching size (this part need discussion)

## How to getIteratorTypesArray

m_packing, n_packing represented iterations are considered `parallel`, and k_packing represented iterations are considered `reduction`.

Take above example:
M is mapped as A(0) -> C(0); A(2) -> C(2)
N is mapped as B(0) -> C(1); B(3) -> C(3)
K is mapped as A(1) -> B(1); A(3) -> B(2, 4)

loops iterator types:
d0: C(0): parallel
d1: C(2): parallel
d2: C(1): parallel
d3: C(3): parallel
d4: B(1): reduction
d5: B(2): reduction
d6: B(4): reduction

## How to getIndexingMaps

Each packing_map will represent how symbols can be added to indexing maps. 
For packing_map dst, AffineExpr for its indices are the AffineSymbols that representing the iterator; 
For packing_map src, AffineExpr for its index is a compound expr that calculated as its indexing related to the dst AffineSymbols and dim size.

```cpp
compound = 0;
for (dstDim : dstDimsArr) {
    currDimExpr = AffineDimExpr(curr);
    dstExprs[dstDim] = currDimExpr;
    compound = compound * getDimSize(dstDim) + currDimExpr ;
}
srcExprs[srcDim] = compound;

// e.g. for a single linalg.packing_map<[0] -> [0, 1, 2]>, with shape: src[64] -> dst[8, 4, 2], 
// indexing maps is src[(d0 * 4 + d1) * 2 + d2], dst[d0, d1, d2]
```

## The generalized Op

This op offers more levels of representation for matmul that want to be expressed as named ops, while can be lowered to linalg.generic using -linalg-generalize-named-ops. 

Take above VNNI packed matmul example:
```mlir
#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d4, d1, d5 * 2 + d6)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d2, d4, d5, d3, d6)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d2, d1, d3)>
module {
  func.func @packed_matmul(%arg0: tensor<2x8x32x32xbf16>, %arg1: tensor<4x8x16x32x2xbf16>, 
                           %arg2: tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16> {
    %0 = linalg.generic {
      indexing_maps = [#map, #map1, #map2], 
      iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
    ins(%arg0, %arg1 : tensor<2x8x32x32xbf16>, tensor<4x8x16x32x2xbf16>) 
    outs(%arg2 : tensor<2x4x32x32xbf16>) { 
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %1 = arith.mulf %in, %in_0 : bf16
      %2 = arith.addf %out, %1 : bf16
      linalg.yield %2 : bf16
    } -> tensor<2x4x32x32xbf16>
    return %0 : tensor<2x4x32x32xbf16>
  }
}
```

## Conclusion

Welcome to contribute to this design, code have been implemented locally to test this. But before a formal PR I think lot of decision should be discussed first: should mapping order(e.g. A->C, B->C) be explicit on the attr? How to better accommodate batch matmul and batch reduce matmul in the future? How to better deal with dynamic dims, etc.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Plan to introduce packed_matmul op and packing_map attr for flexible packing matmul representation in upstream mlir #137

Motivation

Background

The linalg.packing_map Attr

The linalg.packed_matmul Op

How to verify

How to getIteratorTypesArray

How to getIndexingMaps

The generalized Op

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Plan to introduce packed_matmul op and packing_map attr for flexible packing matmul representation in upstream mlir #137

Description

Motivation

Background

The linalg.packing_map Attr

The linalg.packed_matmul Op

How to verify

How to getIteratorTypesArray

How to getIndexingMaps

The generalized Op

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions