Skip to content

[MLIR][XeGPU] Switch to 1D representation for SIMT code #135116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 8 additions & 11 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -833,30 +833,27 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
data type, the matrices are `A: vector<8x16xf16>`, `B: vector<16x16xf16>`,
and `C/D: vector<8x16xf32>`. Besides the matrix size requirements, DPAS
also requires A and B to be loaded with the required data layout. Specially,

VNNI layout is required for B operand. It is achieved via adding `packed`
attribute to the `load_nd` operator. Due to the VNNI transformation, B operands
can be represented as a 3D vector, with the last dimension representing the VNNI
factor, which is computed as `32/bit_width_of_elem_type`. Thus, `B: vector<16x16xf16>`
can be represented as `B: vector<8x16x2xf16>`.

In SIMT mode, DpasOp expects layout attributes `a`, `b`, and `c` (only if acc is used)
which describe the data fragment owned by each work-item w.r.t. the tensor descriptor
these data are loaded from.
In SIMT code, each work-item from a subgroup holds a data fragment for A, B, C and the result,
which are represented as 1D vectors. Please refer to [OpenCL Intel extentions]
(https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html)
for more details about the fragment distribution.

Note: on PVC, the hardware can perform load with VNNI transformation when data
element type is 16-bit or lower precision, taking 2 or 4 elements from
the first dimension and inserted into the newly added innermost dimension.
}];

let arguments = (ins
XeGPU_DpasOpType : $lhs,
XeGPU_DpasOpType : $rhs,
Optional<XeGPU_Vector2DType>: $acc,
OptionalAttr<XeGPU_LayoutAttr>:$a_layout,
OptionalAttr<XeGPU_LayoutAttr>:$b_layout,
OptionalAttr<XeGPU_LayoutAttr>:$c_layout);
let results = (outs XeGPU_Vector2DType: $result);
XeGPU_DpasOprType : $lhs,
XeGPU_DpasOprType : $rhs,
Optional<XeGPU_DpasResType>: $acc);
let results = (outs XeGPU_DpasResType: $result);

let extraClassDeclaration = [{
VectorType getLhsType() {
Expand Down
3 changes: 2 additions & 1 deletion mlir/include/mlir/Dialect/XeGPU/IR/XeGPUTypes.td
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ def XeGPU_IntType: AnyTypeOf<[I1, I8, I16, I32, I64, SI1, SI8, SI16, SI32, SI64,
def XeGPU_FloatType: AnyTypeOf<[F16, F32, F64, BF16, TF32]>;
def XeGPU_ScalarType: AnyTypeOf<[XeGPU_IntType, XeGPU_FloatType]>;
def XeGPU_BaseAddrType: AnyTypeOf<[Non0RankedMemRefOf<[XeGPU_ScalarType]>, UI64, UI32, I64, I32]>;
def XeGPU_DpasOpType: VectorOfRankAndType<[2, 3], [XeGPU_ScalarType]>;
def XeGPU_DpasOprType: VectorOfRankAndType<[1, 2, 3], [XeGPU_ScalarType]>;
def XeGPU_DpasResType: VectorOfRankAndType<[1, 2], [XeGPU_ScalarType]>;
def XeGPU_OffsetType: VectorOfRankAndType<[1], [Index]>;
def XeGPU_MaskType: AnyTypeOf<[VectorOfRankAndType<[1], [I1]>, I1]>;
def XeGPU_ValueType: AnyTypeOf<[VectorOfRankAndType<[1,2,3,4], [XeGPU_ScalarType]>, XeGPU_ScalarType]>;
Expand Down
39 changes: 19 additions & 20 deletions mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include "mlir/IR/Builders.h"
#include "mlir/IR/DialectImplementation.h"
#include "llvm/ADT/TypeSwitch.h"
#include <numeric>

namespace mlir {
namespace xegpu {
Expand Down Expand Up @@ -319,49 +320,48 @@ LogicalResult TensorDescType::verify(
// ---------------------------------------------------------------------
// Case 1: Regular loads/stores.
// ---------------------------------------------------------------------
// Distributed vector shape must be:
// [chunk_size / lane_data_size, lane_data_size]
// If the tensor descriptor shape is 1D, first dimension is ignored (set to 1).
// [lane_data_size]
// The following conditions must be met:
// * tensor_desc[0] == lane_layout[0]
// Distributed vector is a 1D vector with shape:
// [chunk_size]
// ---------------------------------------------------------------------
// Case 2: Block loads/stores
// ---------------------------------------------------------------------
// Additional definitions:
// tensor_size = tensor_desc[0] * .. * tensor_desc[r-1] * array_length
// n_distribution_units = tensor_size / distribution_unit_size
// fragment_size = n_distribution_units * lane_data_size
// Given above definitions, the following conditions must be met:
// * tensor_desc[0] % (lane_layout[0] × lane_data[0]) == 0
// * tensor_desc[1] % (lane_layout[1] × lane_data[1]) == 0
// Distributed vector shape must be:
// [n_distribution_units, lane_data_size]
// Distributed vector is a 1D vector with shape:
// [fragment_size]
FailureOr<VectorType> TensorDescType::getDistributedVectorType() {
auto layout = llvm::dyn_cast_if_present<LayoutAttr>(getLayout());
// If no layout is provided, tensor desc is not used in SIMT mode.
if (!layout)
// It only works for subgroup level layout, which only has lane_layout
// and lane_data, and is to distribute a SIMD code into SIMT code.
if (!layout || !layout.isSgLayout())
return failure();

SmallVector<int64_t> laneData(layout.getLaneData().asArrayRef());
SmallVector<int64_t> laneLayout(layout.getLaneLayout().asArrayRef());
auto tdescShape = getShape();

auto laneDataSize = 1, sgSize = 1;
for (auto [laneDim, laneDataDim] : llvm::zip_equal(laneLayout, laneData)) {
laneDataSize *= laneDataDim;
sgSize *= laneDim;
}
// compute sgSize by multiply elements of laneLayout
// e.g. for 2D layout, sgSize = laneLayout[0] * laneLayout[1]
// e.g. for 1D layout, sgSize = laneLayout[0]
auto sgSize = std::accumulate(laneLayout.begin(), laneLayout.end(), 1,
std::multiplies<int64_t>());

// Case 1: regular loads/stores
auto scatterAttr = getEncodingAsScatterTensorDescAttr();
if (scatterAttr) {
auto chunkSize = scatterAttr.getChunkSize().getInt();
// Verify if the first dimension of the tensor descriptor shape is
// distributable.
assert(tdescShape[0] % (laneLayout[0]) == 0 &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very clear why this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a small issue after confirming with Charitha. tdescShape[0] has to be equal to laneLayout[0], such that a SIMD instruction is dispatched into a SIMT instruction. if tdescShape[0] % lane_layout[0] == 0, it will imply a SIMD instruction could be dispatched into multiple SIMT instructions, which is actually part of logic of blocking.

assert(tdescShape[0] == laneLayout[0] &&
"tensor descriptor shape is not distributable");
if (chunkSize > 1)
return VectorType::get({chunkSize / laneDataSize, laneDataSize},
getElementType());
return VectorType::get({laneDataSize}, getElementType());
return VectorType::get({chunkSize}, getElementType());
}

// Case 2: block loads/stores
Expand All @@ -376,8 +376,7 @@ FailureOr<VectorType> TensorDescType::getDistributedVectorType() {
// tensorSize must be adjusted for array_length.
tensorSize *= getArrayLength();

return VectorType::get({tensorSize / (sgSize * laneDataSize), laneDataSize},
getElementType());
return VectorType::get({tensorSize / sgSize}, getElementType());
}

} // namespace xegpu
Expand Down
Loading
Loading