Skip to content

[RFC] Plan to introduce packed_matmul op and packing_map attr for flexible packing matmul representation in upstream mlir #137

Open
@LongshengDu

Description

@LongshengDu

Motivation

Current linalg dialect only supports fixed input/output shapes for structured named matmul ops, if we want to define different packings for matmul op in existing framework, we have to add each one of them separately. Noticeably, there are a lot of packing method for matmuls and a lot of them are related to hardwares(BFMMLA,VNNI), even some simple packings requires more than a few ops to cover them, e.g. linalg.mmt4d is for (MNmn += Mkmk * NKnk), but for (MNmn += Mkmk * NKkn), etc, we there is no named op to represent. Fundamentally, we also want to avoid adding every single op for each combination.

Thus this post is presenting a new flexible linalg.packed_matmul op using linalg.packing_map attr to map different matmul packing that will help developer utilize named ops to represent matmuls to there needs.

Background

For a simple matmul with input/output shape: (M,N)+=(M,K)*(K,N), dims: A(0,1), B(0,1), C(0,1)
M is mapped as A(0) -> C(0)
N is mapped as B(1) -> C(1)
K is mapped as A(1) -> B(0)

3 iterations is required to carry out the calculation:

for (m : range(M))
  for (n : range(N))
    for (k : range(K))
      C[m, n] += A[m, k] + B[k, n]

For a packed matmul, it is known the 3 dimension M,N,K is presented inside the input/output shapes and if the mapping between 3 kind of dims are determined, the calculation can be finalized.

For example a VNNI packed matmul can be represent as: A(M,N,M0,N0) += A(M,K,M0,K01)*B(N,K,K0,N0,K1),
So for dims: A(0,1,2,3), B(0,1,2,3,4), C(0,1,2,3)
M is mapped as A(0) -> C(0); A(2) -> C(2)
N is mapped as B(0) -> C(1); B(3) -> C(3)
K is mapped as A(1) -> B(1); A(3) -> B(2, 4)

In this case, 7 iterations is required to carry out the calculation and can be represent as:

for (m : range(M))
  for (n : range(N))
    for (m0 : range(M0))
      for (n0 : range(N0))
        for (k : range(K))
          for (k0 : range(K0))
            for (k1 : range(K1))
              C[m, n, m0, n0] += A[m, k, m0, k0 * K1 + k1] + B[n, k, k0, n0, k1]

The linalg.packing_map Attr

e.g. in A[a,b] -> B[x,y,z], if dim [a] corresponding to dim [x]; dim [b] corresponding to packed dims [y,z].
We can express it as linalg.packing_map<[a] -> [x]>, linalg.packing_map<[b] -> [y,z]>, where dims mapping order is A -> B

linalg.packing_map is defined as an attr, it requires 2 int64 arrays params to represent a mapping between 2 sets of sorted indices.
It is needed to verify that one of them contain only 1 index, since multi-dims to multi-dims mapping is not allowed.
This will define a 1->N index set mapping, src is the 1 index, dst is the multi-dims index list.
Some helpers are provided to get the mapping order(first<-second or first->second) and mapping src/dst indices.

def PackingMapAttr : Linalg_Attr<"PackingMap", "packing_map"> {
  let parameters = (ins ArrayRefParameter<"uint64_t">:$first,
                        ArrayRefParameter<"uint64_t">:$second);

  let assemblyFormat = "`<` `[` $first `]` `->` `[` $second `]` `>`";

  let extraClassDeclaration = [{
    /// Index first is 0; Index second is 1
    unsigned getPackingSrcIndex();
    unsigned getPackingDstIndex();
    /// SrcDims.size() == 1; DstDims.size() >= 1
    ArrayRef<uint64_t> getPackingSrcDims();
    ArrayRef<uint64_t> getPackingDstDims();
  }];
}

The linalg.packed_matmul Op

linalg.packed_matmul is defined as a structured named op with LinalgOp Interface, it requires 2 input (A, B) and 1 init (C), the body computes u5(u1(c) + u2(u3(a) * u4(b))) .
It also requires 3 attrs: m_packing, n_packing, k_packing. Each of them is a list of linalg.packing_map attr to map every type of matrix dimension, each attr in this list is to represent a mapping between 2 sets of indices from matrix A, B, C.
We define mapping order as: m_packing A->C, n_packing B->C, k_packing A->B

For the VNNI packed matmul example above, we can express it as:

#m_packing_vnni = [#linalg.packing_map<[0] -> [0]>, #linalg.packing_map<[2] -> [2]>]
#n_packing_vnni = [#linalg.packing_map<[0] -> [1]>, #linalg.packing_map<[3] -> [3]>]
#k_packing_vnni = [#linalg.packing_map<[1] -> [1]>, #linalg.packing_map<[3] -> [2, 4]>]
func.func @packed_matmul(%A: tensor<2x8x32x32xbf16>, %B: tensor<4x8x16x32x2xbf16>, 
                      %C: tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16> {
  %0 = linalg.packed_matmul 
        {m_packing = #m_packing_vnni, n_packing = #n_packing_vnni, k_packing = #k_packing_vnni}
        ins(%A, %B : tensor<2x8x32x32xbf16>, tensor<4x8x16x32x2xbf16>)
        outs(%C : tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16>
  return %0 : tensor<2x4x32x32xbf16>
}

How to verify

Since the mapping is explicit, these are the criteria to verify this op:

  1. linalg.packed_matmul input/output must have rank
  2. all dims inside linalg.packing_map must be permutation of all of dims of A,B,C
  3. all of mapping dims must match size, i.e. in 1->N index set mapping, product of N-dims size equals the 1-dim size
  4. dynamic dims are ignored when matching size (this part need discussion)

How to getIteratorTypesArray

m_packing, n_packing represented iterations are considered parallel, and k_packing represented iterations are considered reduction.

Take above example:
M is mapped as A(0) -> C(0); A(2) -> C(2)
N is mapped as B(0) -> C(1); B(3) -> C(3)
K is mapped as A(1) -> B(1); A(3) -> B(2, 4)

loops iterator types:
d0: C(0): parallel
d1: C(2): parallel
d2: C(1): parallel
d3: C(3): parallel
d4: B(1): reduction
d5: B(2): reduction
d6: B(4): reduction

How to getIndexingMaps

Each packing_map will represent how symbols can be added to indexing maps.
For packing_map dst, AffineExpr for its indices are the AffineSymbols that representing the iterator;
For packing_map src, AffineExpr for its index is a compound expr that calculated as its indexing related to the dst AffineSymbols and dim size.

compound = 0;
for (dstDim : dstDimsArr) {
    currDimExpr = AffineDimExpr(curr);
    dstExprs[dstDim] = currDimExpr;
    compound = compound * getDimSize(dstDim) + currDimExpr ;
}
srcExprs[srcDim] = compound;

// e.g. for a single linalg.packing_map<[0] -> [0, 1, 2]>, with shape: src[64] -> dst[8, 4, 2], 
// indexing maps is src[(d0 * 4 + d1) * 2 + d2], dst[d0, d1, d2]

The generalized Op

This op offers more levels of representation for matmul that want to be expressed as named ops, while can be lowered to linalg.generic using -linalg-generalize-named-ops.

Take above VNNI packed matmul example:

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d4, d1, d5 * 2 + d6)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d2, d4, d5, d3, d6)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d2, d1, d3)>
module {
  func.func @packed_matmul(%arg0: tensor<2x8x32x32xbf16>, %arg1: tensor<4x8x16x32x2xbf16>, 
                           %arg2: tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16> {
    %0 = linalg.generic {
      indexing_maps = [#map, #map1, #map2], 
      iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
    ins(%arg0, %arg1 : tensor<2x8x32x32xbf16>, tensor<4x8x16x32x2xbf16>) 
    outs(%arg2 : tensor<2x4x32x32xbf16>) { 
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %1 = arith.mulf %in, %in_0 : bf16
      %2 = arith.addf %out, %1 : bf16
      linalg.yield %2 : bf16
    } -> tensor<2x4x32x32xbf16>
    return %0 : tensor<2x4x32x32xbf16>
  }
}

Conclusion

Welcome to contribute to this design, code have been implemented locally to test this. But before a formal PR I think lot of decision should be discussed first: should mapping order(e.g. A->C, B->C) be explicit on the attr? How to better accommodate batch matmul and batch reduce matmul in the future? How to better deal with dynamic dims, etc.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions