Skip to content

[mlir][xegpu] Add SIMT distribution patterns for UpdateNdOffset, PrefetchNd and GPU Index Ops. #136743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 78 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
39dcf9d
save work
charithaintc Mar 18, 2025
2058773
moving all ops to region working
charithaintc Mar 20, 2025
14233fa
moving all ops to region working
charithaintc Mar 20, 2025
f599873
save work
charithaintc Mar 20, 2025
220ed1f
save work
charithaintc Mar 21, 2025
2a8070f
save work
charithaintc Mar 21, 2025
4838b52
extend sg_map from subgroup to workgroup
chencha3 Mar 21, 2025
cb26979
format code
chencha3 Mar 21, 2025
273fc40
remove changes to prefetch op
chencha3 Mar 21, 2025
504d274
refine the doc for TensorDesc
chencha3 Mar 21, 2025
90e0704
save work
charithaintc Mar 21, 2025
3abe7cb
save work
charithaintc Mar 21, 2025
7c87319
Merge branch 'main' into xegpu_simt_dist
charithaintc Mar 21, 2025
596c953
update doc
chencha3 Mar 21, 2025
2065764
save work
charithaintc Mar 21, 2025
899439b
refine docs
chencha3 Mar 24, 2025
8636d15
refine docs
chencha3 Mar 24, 2025
0190418
refine util
chencha3 Mar 24, 2025
32f9272
refine convert_layout docs
chencha3 Mar 24, 2025
fe11c79
save work
charithaintc Mar 24, 2025
6e1ef3e
save work
charithaintc Mar 24, 2025
55c272c
save work
charithaintc Mar 25, 2025
ee56a3e
Merge branch 'gpu_dialect_changes' into xegpu_simt_dist
charithaintc Mar 25, 2025
1ffe5c8
save work
charithaintc Mar 26, 2025
e5521f9
save work before merging with Chao's PR
charithaintc Mar 27, 2025
350b581
Merge branch 'users/chencha3/xegpu/extend_sg_map' into xegpu_simt_dist
charithaintc Mar 27, 2025
5700c81
merge xegpu changes
charithaintc Mar 29, 2025
1619fcf
Merge branch 'main' into xegpu_simt_dist
charithaintc Mar 31, 2025
2334a97
refactor names
charithaintc Mar 31, 2025
9bddeb6
drop ScopeAttr and refine 1D layout support
chencha3 Apr 1, 2025
784ab38
refine isEvenDistributed
chencha3 Apr 1, 2025
28cf69e
format code
chencha3 Apr 1, 2025
930f1ab
Merge branch 'main' into extend_sg_map
chencha3 Apr 1, 2025
9ed0f87
fix format issue
chencha3 Apr 1, 2025
3b389bf
add 1D layout examples
chencha3 Apr 1, 2025
589d217
refactor names
charithaintc Apr 2, 2025
8b647c4
Merge branch 'users/chencha3/xegpu/extend_sg_map' into xegpu_simt_dist
charithaintc Apr 2, 2025
c6ccef2
refactor
charithaintc Apr 2, 2025
cbd0af0
refine LayoutAttr verifier
chencha3 Apr 4, 2025
3fb4fd4
add unit test
chencha3 Apr 4, 2025
77fdfef
remove dump file
chencha3 Apr 4, 2025
2751332
fix typo
chencha3 Apr 4, 2025
2a16d11
Merge branch 'main' into extend_sg_map
chencha3 Apr 4, 2025
d281a14
fix an error after mering with main
chencha3 Apr 4, 2025
fb28ce8
new line at the end of file
chencha3 Apr 7, 2025
f464662
update doc
chencha3 Apr 8, 2025
eea3c35
Merge branch 'main' into extend_sg_map
chencha3 Apr 8, 2025
7acc56d
Merge branch 'users/chencha3/xegpu/extend_sg_map' into xegpu_simt_dist
charithaintc Apr 8, 2025
270b498
Merge branch 'main' into xegpu_simt_dist
charithaintc Apr 9, 2025
2a1d373
Switch to 1D representation for SIMT
chencha3 Apr 10, 2025
2159119
refine verfier for load_nd and store_nd
chencha3 Apr 10, 2025
21f50c0
fix issues
charithaintc Apr 10, 2025
35f9cbe
Merge branch 'main' into xegpu_simt_dist
charithaintc Apr 10, 2025
c81b2e0
fix issues
charithaintc Apr 10, 2025
03bfe08
Merge branch 'users/chencha3/xegpu/xegpu_simt_2d_to_1d' into xegpu_si…
charithaintc Apr 11, 2025
2f2ec10
fix issues
charithaintc Apr 14, 2025
2ae3543
fix issues
charithaintc Apr 14, 2025
4c63916
fix issues
charithaintc Apr 14, 2025
2d9cfa3
fix build issue
charithaintc Apr 15, 2025
775d039
refine verifier for gather/scatter
chencha3 Apr 15, 2025
5520ce1
update comments
chencha3 Apr 15, 2025
6abc12a
fix tests
charithaintc Apr 15, 2025
379e186
fix
charithaintc Apr 16, 2025
aa7dbe1
fix
charithaintc Apr 16, 2025
dce6d2a
Merge branch 'users/chencha3/xegpu/xegpu_simt_2d_to_1d' into xegpu_si…
charithaintc Apr 16, 2025
ca5c7e9
fix comments
charithaintc Apr 16, 2025
ed3119c
fix comments
charithaintc Apr 16, 2025
c898de6
fix comments
charithaintc Apr 17, 2025
55be710
fix comments
charithaintc Apr 17, 2025
6e8888a
fix
charithaintc Apr 18, 2025
6ae7aa0
fix
charithaintc Apr 18, 2025
2896b34
Merge branch 'main' into xegpu_simt_dist
charithaintc Apr 18, 2025
68b1750
fix
charithaintc Apr 18, 2025
5f1798d
save work
charithaintc Apr 21, 2025
9391696
Merge branch 'main' into xegpu_simt_dist
charithaintc Apr 22, 2025
b3e6dc5
save work
charithaintc Apr 22, 2025
08d9e7b
Merge branch 'xegpu_simt_dist' into distribute_scf
charithaintc Apr 22, 2025
6447c63
add prefetch support
charithaintc Apr 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Switch to 1D representation for SIMT
  • Loading branch information
chencha3 committed Apr 10, 2025
commit 2a1d373a61ca10bca9064a2afa7ac1fb88a87fc8
17 changes: 6 additions & 11 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -833,30 +833,25 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
data type, the matrices are `A: vector<8x16xf16>`, `B: vector<16x16xf16>`,
and `C/D: vector<8x16xf32>`. Besides the matrix size requirements, DPAS
also requires A and B to be loaded with the required data layout. Specially,

VNNI layout is required for B operand. It is achieved via adding `packed`
attribute to the `load_nd` operator. Due to the VNNI transformation, B operands
can be represented as a 3D vector, with the last dimension representing the VNNI
factor, which is computed as `32/bit_width_of_elem_type`. Thus, `B: vector<16x16xf16>`
can be represented as `B: vector<8x16x2xf16>`.

In SIMT mode, DpasOp expects layout attributes `a`, `b`, and `c` (only if acc is used)
which describe the data fragment owned by each work-item w.r.t. the tensor descriptor
these data are loaded from.
In SIMT code, each work-item from a subgroup holds a data fragment for A, B, C and the result,
which are represented as 1D vectors.

Note: on PVC, the hardware can perform load with VNNI transformation when data
element type is 16-bit or lower precision, taking 2 or 4 elements from
the first dimension and inserted into the newly added innermost dimension.
}];

let arguments = (ins
XeGPU_DpasOpType : $lhs,
XeGPU_DpasOpType : $rhs,
Optional<XeGPU_Vector2DType>: $acc,
OptionalAttr<XeGPU_LayoutAttr>:$a_layout,
OptionalAttr<XeGPU_LayoutAttr>:$b_layout,
OptionalAttr<XeGPU_LayoutAttr>:$c_layout);
let results = (outs XeGPU_Vector2DType: $result);
XeGPU_DpasOprType : $lhs,
XeGPU_DpasOprType : $rhs,
Optional<XeGPU_DpasResType>: $acc);
let results = (outs XeGPU_DpasResType: $result);

let extraClassDeclaration = [{
VectorType getLhsType() {
Expand Down
3 changes: 2 additions & 1 deletion mlir/include/mlir/Dialect/XeGPU/IR/XeGPUTypes.td
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ def XeGPU_IntType: AnyTypeOf<[I1, I8, I16, I32, I64, SI1, SI8, SI16, SI32, SI64,
def XeGPU_FloatType: AnyTypeOf<[F16, F32, F64, BF16, TF32]>;
def XeGPU_ScalarType: AnyTypeOf<[XeGPU_IntType, XeGPU_FloatType]>;
def XeGPU_BaseAddrType: AnyTypeOf<[Non0RankedMemRefOf<[XeGPU_ScalarType]>, UI64, UI32, I64, I32]>;
def XeGPU_DpasOpType: VectorOfRankAndType<[2, 3], [XeGPU_ScalarType]>;
def XeGPU_DpasOprType: VectorOfRankAndType<[1, 2, 3], [XeGPU_ScalarType]>;
def XeGPU_DpasResType: VectorOfRankAndType<[1, 2], [XeGPU_ScalarType]>;
def XeGPU_OffsetType: VectorOfRankAndType<[1], [Index]>;
def XeGPU_MaskType: AnyTypeOf<[VectorOfRankAndType<[1], [I1]>, I1]>;
def XeGPU_ValueType: AnyTypeOf<[VectorOfRankAndType<[1,2,3,4], [XeGPU_ScalarType]>, XeGPU_ScalarType]>;
Expand Down
26 changes: 12 additions & 14 deletions mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include "mlir/IR/Builders.h"
#include "mlir/IR/DialectImplementation.h"
#include "llvm/ADT/TypeSwitch.h"
#include <numeric>

namespace mlir {
namespace xegpu {
Expand Down Expand Up @@ -336,32 +337,30 @@ LogicalResult TensorDescType::verify(
// [n_distribution_units, lane_data_size]
FailureOr<VectorType> TensorDescType::getDistributedVectorType() {
auto layout = llvm::dyn_cast_if_present<LayoutAttr>(getLayout());
// If no layout is provided, tensor desc is not used in SIMT mode.
if (!layout)
// It only works for subgroup level layout, which only has lane_layout
// and lane_data, and is to distribute a SIMD code into SIMT code.
if (!layout || !layout.isSgLayout())
return failure();

SmallVector<int64_t> laneData(layout.getLaneData().asArrayRef());
SmallVector<int64_t> laneLayout(layout.getLaneLayout().asArrayRef());
auto tdescShape = getShape();

auto laneDataSize = 1, sgSize = 1;
for (auto [laneDim, laneDataDim] : llvm::zip_equal(laneLayout, laneData)) {
laneDataSize *= laneDataDim;
sgSize *= laneDim;
}
// compute sgSize by multiply elements of laneLayout
// e.g. for 2D layout, sgSize = laneLayout[0] * laneLayout[1]
// e.g. for 1D layout, sgSize = laneLayout[0]
auto sgSize = std::accumulate(laneLayout.begin(), laneLayout.end(), 1,
std::multiplies<int64_t>());

// Case 1: regular loads/stores
auto scatterAttr = getEncodingAsScatterTensorDescAttr();
if (scatterAttr) {
auto chunkSize = scatterAttr.getChunkSize().getInt();
// Verify if the first dimension of the tensor descriptor shape is
// distributable.
assert(tdescShape[0] % (laneLayout[0]) == 0 &&
assert(tdescShape[0] == laneLayout[0] &&
"tensor descriptor shape is not distributable");
if (chunkSize > 1)
return VectorType::get({chunkSize / laneDataSize, laneDataSize},
getElementType());
return VectorType::get({laneDataSize}, getElementType());
return VectorType::get({chunkSize}, getElementType());
}

// Case 2: block loads/stores
Expand All @@ -376,8 +375,7 @@ FailureOr<VectorType> TensorDescType::getDistributedVectorType() {
// tensorSize must be adjusted for array_length.
tensorSize *= getArrayLength();

return VectorType::get({tensorSize / (sgSize * laneDataSize), laneDataSize},
getElementType());
return VectorType::get({tensorSize / sgSize}, getElementType());
}

} // namespace xegpu
Expand Down
Loading