-
Notifications
You must be signed in to change notification settings - Fork 44
Extend XeGPU sg_map attribute to support workgroup level semantics #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
save work
save work
docs/rfcs/XeGPU.md
Outdated
By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. | ||
|
||
**Attribute xegpu.wg_map** | ||
`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two->three
docs/rfcs/XeGPU.md
Outdated
@@ -701,6 +700,332 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor | |||
|
|||
``` | |||
|
|||
## Workgroup level XeGPU Operations | |||
|
|||
By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest using lane
instead of wi
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both lane and work item used in GPU dialect.
The problem of using "lane_layout" is more like describing the layout of hardware.
"wi_layout" convey the meaning that it is about the layout of wi threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate a bit what is the reason you likes "lane" more than "wi"? @nmostafa Still struggling on the name.
docs/rfcs/XeGPU.md
Outdated
```mlir | ||
sg_data_size = sg_data[0] × sg_data[1] | ||
workgroup_size = sg_layout[0] × sg_layout[1] | ||
tensor_size = tensor_desc[0] × tensor_desc[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using tensor_shape is better than tensor_desc.
docs/rfcs/XeGPU.md
Outdated
|
||
For a subgroup threads in 3-d sg_layout [dim_0, dim_1, dim_2], sg_order[2, 1, 0] maps a subgroup thread with 3-d index [x, y, z] to a linear subgroup thread index [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x]. | ||
|
||
When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wg_map is a property of vector values (hardware registers), not in-memory layout of the tensor tile. Maybe better to just use attributes for all operations instead of a mix of tensor_desc type-attribute and op attributes.
|
||
The following conditions must hold: | ||
|
||
* workgroup_size must represent the number of subgroups in a workgroup for a kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily true with warp-specialization.
docs/rfcs/XeGPU.md
Outdated
|
||
**Resulting WI Data Fragment** | ||
|
||
The distributed tensor for each subgroup has the same dimension as the work group level tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be "rank" not dimension. meaning that WG -> SG distribution is non-rank reducing transformation. right?
this maybe not true for SG -> WI distribution because WI level its 1D or 2D.
docs/rfcs/XeGPU.md
Outdated
#wg_map_a = #xegpu.wg_map<sg_layout = [2, 2], sg_data = [32, 128], sg_order = [1, 0]> | ||
%wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<128x128xf16, #wg_map_a> | ||
``` | ||
The table below shows the result tensor for each subgroup thread and its linear subgroup thread id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid subgroup thread
. Confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. modified
docs/rfcs/XeGPU.md
Outdated
| [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 | | ||
| [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 | | ||
|
||
The `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes a gemm kernel. We might want to consider layout anchor operation that we can propagate from (e.g. xegpu.assignLayout %vec, {#wg_map_a}
). This would cover both gemm and non-gemm cases, then we can remove the layout out of tensor_desc data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store is another anchor. would that be good enough for non-gemm use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is a dedicated instruction with the layout attribute part of the op definition, so it the attribute cannot be dropped by MLIR folding passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the specific anchor operations on non-gemm case you are looking for?
One alternative is to extend convert_layout op, which accepts identical map at the beginning of the propagation, to indicate the requirement of layout. But better to understand the use case better.
%vector_a = xegpu.convert_layout %vector_b {#sg_map_a #sg_map_a }: vector<256x256xfloat> into vector<256x256xfloat>
``` | ||
|
||
For `reduction`, `wg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that we could introduce a new op for partial reduction. Is it a TODO or decided to go with vector op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the map supporting multiple dimension, and we don't think it is a must anymore.
But for future architectures, we may introduce the partial reduction to expose the hardware reduction semantics.
} | ||
} | ||
``` | ||
## Appendix 1.2 Gemm with transpose, broadcast, and reduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also have an example with input matrix in sg_order=[1,0]?
docs/rfcs/XeGPU.md
Outdated
|
||
**Extended xegpu.sg_map** | ||
|
||
The extended `sg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. sg_map consists of four parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be "among subgroupS" I guess
docs/rfcs/XeGPU.md
Outdated
|
||
**distribution rule** | ||
|
||
The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rather than saying "broadcast", it is better to say these data is "shared" across all SGs in that dim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following Triton's terminology.
See: https://github.com/triton-lang/triton/blob/main/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td#L648
docs/rfcs/XeGPU.md
Outdated
|
||
**Resulting WI Data Fragment** | ||
|
||
The distributed tensor for each subgroup has the same dimension as the work group level tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be "rank" not dimension. meaning that WG -> SG distribution is non-rank reducing transformation. right?
this maybe not true for SG -> WI distribution because WI level its 1D or 2D.
```mlir | ||
#sg_map_d = #xegpu.sg_map<sg_layout = [8, 4], sg_data = [32, 64], wi_layout=[1,16], wi_data = [1, 1], order=[1, 0]> | ||
%vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}: | ||
vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> | |
vector<256x32xfloat>, vector<32x256xbf16>, vector<256x256xbf16> |
docs/rfcs/XeGPU.md
Outdated
%vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}: | ||
vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16> | ||
into vector<256x256xfloat> | ||
//derived sg_map for input operands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is still some flexibility on deciding the K dimension for sg_data in A and B here. for example sg_data_a = [32, 16] and sg_data_b = [16, 64] must still be a valid choice right?
docs/rfcs/XeGPU.md
Outdated
|
||
```mlir | ||
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 128], wi_layout=[1,16], wi_data = [1, 1], order = [1, 0]> | ||
%wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #sg_map_a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be tensor_desc<128x128xf16, #sg_map_a>
?
docs/rfcs/XeGPU.md
Outdated
The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroups. | ||
|
||
```mlir | ||
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 128], wi_layout=[1,16], wi_data = [1, 1], order = [1, 0]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By exposing sg_data as another degree of freedom, we allow nested wrap-around of lanes: inside a sub-group tile, and over the entire tensor.
Consider:
#sg_map_a = #xegpu.sg_map<sg_layout = [2], sg_data = [32], wi_layout=[16], wi_data = [1], order = [1, 0]> tensor_desc<128xf16, #sg_map_a>
sg0:Lane0 will own elements 0, 16, 64, 80
Is there a motivation for expressing such complex walks, or can we just omit sg_data and have it inferred from wi_layout * wi_data ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. we use sg_data to support broadcast at wg to sg level, but "evenly distribute" at sg to wi level.
Say, Matrix A [64, 64] at wg level, we have 4 sg in 2x2 layout, each sg takes [32,64] (note the second dimension is broadcast to 2 layout in a row).
The map needs to be -
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 64], wi_layout=[1, 16], wi_data = [1, 1], order = [1, 0]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but "evenly distribute" at sg to wi level
This might be confusing. Might as well just allow same broadcast/wrap-around semantics within a sub-group.
add inst_data remove scope remove the statements about lane_data implies packed data unit change the result of WI distribution being 1D. packing happens on 1D WI level code, not related to layout.
Please review these guidelines to help with the review process: