Skip to content

[MLIR][XeGPU] Extend SGMapAttr and Add ConvertLayoutOp #132425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 9, 2025
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 98 additions & 20 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def XeGPU_BlockTensorDescAttr: XeGPU_TensorDescAttr<"BlockTensorDesc", "block_td
It is default to `Global`.
2. `array_length`: It describes how many horizontally consecutive blocks
will be loaded by a hardware load instruction. If the TensorDesc shape
is 8x16, with array_length = 2. The loaded block shape will be acctually
is 8x16, with array_length = 2. The loaded block shape will be actually
8x32. Its default value is 1.
3. `boundary_check`: It is used to indicates the hardware whether to do
out-of-boundary check. The default value is true.
Expand Down Expand Up @@ -154,33 +154,111 @@ def XeGPU_FenceScopeAttr:
let assemblyFormat = "$value";
}

def XeGPU_SGMapAttr : XeGPUAttr<"SGMap", "sg_map"> {
def XeGPU_LayoutAttr : XeGPUAttr<"Layout", "layout"> {
let summary = [{
Describes the mapping between work item (WI) and the 2D tensor specified by the tensor descriptor.
Describes the data distribution to subgroups and work-items for a tensor
specified by the tensor descriptor.
}];
let description = [{
To distribute the XeGPU operation to work items, the tensor_desc must be specified with the sg_map
attribute at the tensor description creation time.
Within the `sg_map`, `wi_layout` specifies the layout of work items,
describing the mapping of work items to the tensor.
wi_layout[0] x wi_layout[1] must be equal to the total number of work items within a subgroup.
`wi_data` specifies the minimum number of data elements assigned to each work item for a single distribution.

E.g., #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
In this example, the subgroup has 16 work items in wi_layout=[1, 16],
each accessing 1 element as specified by wi_data=[1, 1].

`wi_data[0] * wi_data[1]` can be greater than 1, meaning that each work item operates on multiple elements,
which is eventually lowered to "SIMT-flavor" vector, like SPIR-V vector or llvm vector, or packed to a storage data type.
The multiple elements indicated by `wi_data` can only be from one dimension and must be contiguous in the memory along either dimension.
XeGPU operations use `LayoutAttr` to define how data is distributed across subgroups and work-items.
This attribute is specified in tensor descriptors during tensor description creation. `LayoutAttr`
includes the following parameters:

* `sg_layout`: Specifies the total number of subgroups and their layout within a workgroup.
It is mandatory for workgroup-level programming and optional for subgroup programming. Its
presence implies workgroup-level code.
* `sg_data`: Defines the data size accessed per subgroup. It is optionally used with `sg_layout`
for workgroup-level programming. When it is left empty, the size accessed per subgroup can be
derived from the tensor shape and `sg_layout` using the formula:
`sg_data[i] = tensor_shape[i] / sg_layout[i]`.
* `inst_data`: Specifies the data size that is processed by an instruction. It is optionally
used with lane_layout. When it is left empty, the data size per instruction is equivalent to
the sg_data for workgroup-level programming or equivalent to tensor shape for subgroup-level
programming.
* `lane_layout` : Specifies the total number of work-items and their arrangement within a subgroup.
It is mandatory for subgroup-level programming and optional for workgroup-level programming.
* `lane_data` : Specifies the shape of the tensor fragment that each lane accesses. It defines a single,
minimal distribution unit. Processing the entire tensor may require one or more distribution units per
hardware instruction.
* `order`: Specifies the dimension order used to linearize n-dimensional sg_layout and lane_layout to
1-dimensional layout. The first dimension in the order list is the fastest-changing dimension. If it
is not present, the default value is [1, 0].

### Examples:
1. Subgroup level layout:
```mlir
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>
```
In this example, there are 16 work-items per subgroup, and is organized as
[[0, 1, 2, .., 7],[8, 9, .., 15]]. The distribution unit is 1x1.

2. Subgroup level layout with order:
```mlir
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
```
In this example, there are 16 work-items per subgroup, and is organized as
[[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. The distribution unit is 1x1.

3. Workgroup level layout:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1]>
```
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
arranged as [[0, 1, 2, 3], [4, 5, 6, 7]]. Each subgroup accesses a 16x16 block per instruction, which
is further distributed to 16 work items which is organized as [[0, 1, 2, .., 7],[8, 9, .., 15]].

4. Workgroup level layout with order:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
```
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
arranged as [[0, 2, 4, 6], [1, 3, 5, 7]]. Each subgroup accesses a 16x16 block per instruction, which
is further distributed to 16 work items which is organized as [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]].

}];

let parameters = (ins
ArrayRefParameter<"uint32_t">:$wi_layout,
ArrayRefParameter<"uint32_t">:$wi_data
OptionalParameter<"DenseI32ArrayAttr">: $sg_layout,
OptionalParameter<"DenseI32ArrayAttr">: $sg_data,
OptionalParameter<"DenseI32ArrayAttr">: $inst_data,
OptionalParameter<"DenseI32ArrayAttr">: $lane_layout,
OptionalParameter<"DenseI32ArrayAttr">: $lane_data,
OptionalParameter<"DenseI32ArrayAttr">: $order
);

let builders = [
AttrBuilder<(ins "llvm::ArrayRef<int>": $lane_layout,
"llvm::ArrayRef<int>": $lane_data),
[{
auto sg_layout = DenseI32ArrayAttr();
auto sg_data = DenseI32ArrayAttr();
auto inst_data = DenseI32ArrayAttr();
auto order = DenseI32ArrayAttr();
return $_get($_ctxt, sg_layout, sg_data, inst_data,
DenseI32ArrayAttr::get($_ctxt, lane_layout),
DenseI32ArrayAttr::get($_ctxt, lane_data), order);
}]>
];

let extraClassDeclaration = [{
bool isWgLayout() {
return getSgLayout() != nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return getSgLayout() != nullptr;
return getSgLayout() != DenseI32ArrayAttr();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more common and generic to check if attribute is not null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more common and generic to check if attribute is not null

You mean you prefer getSgLayout() != nullptr, am I right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

}

bool isSgLayout() {
return getSgLayout() == nullptr && getLaneLayout() != nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return getSgLayout() == nullptr && getLaneLayout() != nullptr;
return getSgLayout() == DenseI32ArrayAttr() && getLaneLayout() != DenseI32ArrayAttr();

}

let hasCustomAssemblyFormat = 1;
int64_t getRank() {
if (auto attr = getSgLayout())
return attr.size();
if (auto attr = getLaneLayout())
return attr.size();
return 0;
}
}];

let assemblyFormat = "`<` struct(params) `>`";
let genVerifyDecl = 1;
}

Expand Down
64 changes: 44 additions & 20 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface
information e.g., memref<?x?xf16>, the strides information has to be explicitly
passed via the "strides" and "const_strides" argument.

In SIMT mode, tensor descriptor is augmented with `SGMapAttr` which describes the
In SIMT mode, tensor descriptor is augmented with `LayoutAttr` which describes the
mapping of the tensor descriptor to the work items.

Example 1 (suppose the tensor shape inferred by the compiler is 8x16):
Expand Down Expand Up @@ -113,7 +113,7 @@ def XeGPU_CreateNdDescOp: XeGPU_Op<"create_nd_tdesc", [Pure, ViewLikeOpInterface
%c0 = arith.constant 0 : index
%c1 = arith.constant 8 : index
%1 = xegpu.create_nd_tdesc %0[%c0, %c0] : memref<1024x1024xf32>
-> !xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
-> !xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
```
}];

Expand Down Expand Up @@ -306,7 +306,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
fp32 or fp64. It implies that vnni and transpose cannot exit at the
same time.

In SIMT mode, LoadNdOp expects the tensor descriptor to be augmented with `SGMapAttr`
In SIMT mode, LoadNdOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, result
vector represents the data to be loaded by each work-item.

Expand All @@ -323,7 +323,7 @@ def XeGPU_LoadNdOp : XeGPU_Op<"load_nd", [
xegpu.load_nd %1 {l1_hint = #xegpu.cache_hint<cached>,
l2_hint = #xegpu.cache_hint<uncached>}>
: !xegpu.tensor_desc<8x16xf32,
#xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>> -> vector<8x1xf32>
#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> -> vector<8x1xf32>
```


Expand Down Expand Up @@ -364,7 +364,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
of cache, L1, L2 and L3. If hardware does not have a correspoding cache,
Corresponding cache hint attribute will be masked.

In SIMT mode, StoreNdOp expects the tensor descriptor to be augmented with `SGMapAttr`
In SIMT mode, StoreNdOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, input
vector represents the data to be stored by each work-item.

Expand All @@ -381,7 +381,7 @@ def XeGPU_StoreNdOp : XeGPU_Op<"store_nd", [
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}
: vector<8x1xf16>, !xegpu.tensor_desc<8x16xf16,
#xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
```


Expand Down Expand Up @@ -422,7 +422,7 @@ def XeGPU_UpdateNdOffsetOp : XeGPU_Op<"update_nd_offset",
Example 2 (SIMT mode):
```
%2 = xegpu.update_nd_offset %1, [0, 16]:
!xegpu.tensor_desc<8x16xf32, #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>>
!xegpu.tensor_desc<8x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
```
}];

Expand Down Expand Up @@ -482,7 +482,7 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
the chunk_size if the chunk size is larger than 1.

In SIMT mode, similar to `create_nd_tdesc` the resulting tensor descriptor is augmented
with `SGMapAttr` which describes the mapping of the tensor descriptor to the work items.
with `LayoutAttr` which describes the mapping of the tensor descriptor to the work items.
In this case, the first dimension of the tensor descriptor represents the work-items, and
the second dimension represents the chunk size.

Expand Down Expand Up @@ -517,7 +517,7 @@ def XeGPU_CreateDescOp: XeGPU_Op<"create_tdesc", [Pure, ViewLikeOpInterface]> {
%off = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex>
%1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex>
-> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>,
#xegpu.sg_map<wi_layout = [4, 1], wi_data = [1, 1]>>
#xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>
```
}];

Expand Down Expand Up @@ -623,7 +623,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [
The mask operand masks out memory access so that it is safe to pass out-of-boundary
addresses/offsets as long as they are masked. It applies to slots of SIMD lanes.

In SIMT mode, LoadGatherOp expects the tensor descriptor to be augmented with `SGMapAttr`
In SIMT mode, LoadGatherOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, result vector
represents the data to be loaded by each work-item. Each work-item recieves a `chunk_size`
number of elements.
Expand Down Expand Up @@ -653,7 +653,7 @@ def XeGPU_LoadGatherOp : XeGPU_Op<"load", [
l2_hint = #xegpu.cache_hint<uncached>,
l3_hint = #xegpu.cache_hint<uncached>}
: !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>,
!xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>>
!xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>>
vector<16xi1> -> vector<8x1xf32>
```

Expand Down Expand Up @@ -704,7 +704,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [
has transpose effect, which is similar to `load_gather`. Therefore, a transpose attribute is
introduced on purpose, making sure users are aware of this implicit transformation.

In SIMT mode, StoreScatterOp expects the tensor descriptor to be augmented with `SGMapAttr`
In SIMT mode, StoreScatterOp expects the tensor descriptor to be augmented with `LayoutAttr`
which describes the mapping of the tensor to the work items. In this case, input vector
represents the data to be stored by each work-item. Each work-item recieves a `chunk_size`
number of elements.
Expand Down Expand Up @@ -732,7 +732,7 @@ def XeGPU_StoreScatterOp : XeGPU_Op<"store", [
l2_hint = #xegpu.cache_hint<write_back>,
l3_hint = #xegpu.cache_hint<write_through>}
: vector<8x1xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>,
!xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>> vector<16xi1>
!xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>> vector<16xi1>
```

}];
Expand Down Expand Up @@ -790,7 +790,7 @@ def XeGPU_UpdateOffsetOp: XeGPU_Op<"update_offset",
%off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex>
%2 = xegpu.update_offset %1, %off :
!xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>,
#xegpu.sg_map<wi_layout = [4, 1], wi_data = [1, 1]>>, vector<4xindex>
#xegpu.layout<lane_layout = [4, 1], lane_data = [1, 1]>>, vector<4xindex>
```
}];

Expand Down Expand Up @@ -840,9 +840,9 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
factor, which is computed as `32/bit_width_of_elem_type`. Thus, `B: vector<16x16xf16>`
can be represented as `B: vector<8x16x2xf16>`.

In SIMT mode, DpasOp expects attributes `sg_map_a`, `sg_map_b`, and `sg_map_c`
which descibes the data fragment owned by each work-item w.r.t. the tensor
descriptor these data are loaded from.
In SIMT mode, DpasOp expects layout attributes `a`, `b`, and `c` (only if acc is used)
which describe the data fragment owned by each work-item w.r.t. the tensor descriptor
these data are loaded from.

Note: on PVC, the hardware can perform load with VNNI transformation when data
element type is 16-bit or lower precision, taking 2 or 4 elements from
Expand All @@ -853,9 +853,9 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
XeGPU_DpasOpType : $lhs,
XeGPU_DpasOpType : $rhs,
Optional<XeGPU_Vector2DType>: $acc,
OptionalAttr<XeGPU_SGMapAttr>:$sg_map_a,
OptionalAttr<XeGPU_SGMapAttr>:$sg_map_b,
OptionalAttr<XeGPU_SGMapAttr>:$sg_map_c);
OptionalAttr<XeGPU_LayoutAttr>:$a_layout,
OptionalAttr<XeGPU_LayoutAttr>:$b_layout,
OptionalAttr<XeGPU_LayoutAttr>:$c_layout);
let results = (outs XeGPU_Vector2DType: $result);

let extraClassDeclaration = [{
Expand All @@ -876,6 +876,10 @@ def XeGPU_DpasOp : XeGPU_Op<"dpas", [Pure, AllElementTypesMatch<["lhs", "rhs"]>]
VectorType getResultType() {
return getResult().getType();
}

bool hasAcc() {
return getAcc() != nullptr;
}
}];

let assemblyFormat = [{
Expand Down Expand Up @@ -979,4 +983,24 @@ def XeGPU_FenceOp: XeGPU_Op<"fence", []> {
let extraClassDeclaration = extraBaseClassDeclaration;
}

def XeGPU_ConvertLayoutOp: XeGPU_Op<"convert_layout", [Pure, AllTypesMatch<["source", "result"]>]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be lowered to some sequence of real instructions? Also, why is it not supported for the wi-level layout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It is essentially doing data sharing/shuffle among subgroups and lanes and will be lowered majorly SLM accesses. It is not meaningful for wi-level layout.

let summary = "Convert the layout of the input operand";
let description = [{
`convert_layout` adjusts the data distribution across subgroups and/or work-items by modifying
the `LayoutAttr`. Both `srcMap` and `resMap` must correspond to the same programming scope, such
as workgroup-level (wg) or subgroup-level (sg) code. This operation is not valid once the IR is
lowered to WI level because that is the end result of all distributions.
}];
let arguments = (ins XeGPU_Vector2DType: $source,
XeGPU_LayoutAttr: $srcMap,
XeGPU_LayoutAttr: $resMap
);
let results = (outs XeGPU_Vector2DType: $result);
let assemblyFormat = [{
$source attr-dict `:` type($source)
}];

let hasVerifier = 1;
}

#endif // MLIR_DIALECT_XEGPU_IR_XEGPUOPS_TD
Loading
Loading