Extend XeGPU sg_map attribute to support workgroup level semantics #1033

Jianhui-Li · 2025-02-28T03:13:59Z

Please review these guidelines to help with the review process:

Have you provided a meaningful PR description?
Have you added a test, a reproducer, or a reference to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
If this PR is a work in progress, are you filing the PR as a draft?
Have you organized your commits logically and ensured each can be built by itself?

save work

Garra1980 · 2025-02-28T17:48:46Z

docs/rfcs/XeGPU.md

+By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs.
+
+**Attribute xegpu.wg_map**
+`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters:


save work

docs/rfcs/XeGPU.md

nmostafa · 2025-03-06T19:41:05Z

docs/rfcs/XeGPU.md

@@ -701,6 +700,332 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor

 ```

+## Workgroup level XeGPU Operations
+
+By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs.


suggest using lane instead of wi.

Both lane and work item used in GPU dialect.
The problem of using "lane_layout" is more like describing the layout of hardware.
"wi_layout" convey the meaning that it is about the layout of wi threads.

can you elaborate a bit what is the reason you likes "lane" more than "wi"? @nmostafa Still struggling on the name.

docs/rfcs/XeGPU.md

chencha3 · 2025-03-10T15:46:54Z

docs/rfcs/XeGPU.md

+```mlir
+sg_data_size = sg_data[0] × sg_data[1]
+workgroup_size = sg_layout[0] × sg_layout[1]
+tensor_size = tensor_desc[0] × tensor_desc[1]


I think using tensor_shape is better than tensor_desc.

docs/rfcs/XeGPU.md

nmostafa · 2025-03-12T19:53:08Z

docs/rfcs/XeGPU.md

+
+For a subgroup threads in 3-d sg_layout [dim_0, dim_1, dim_2], sg_order[2, 1, 0] maps a subgroup thread with 3-d index [x, y, z] to a linear subgroup thread index [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x].
+
+When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor.


The wg_map is a property of vector values (hardware registers), not in-memory layout of the tensor tile. Maybe better to just use attributes for all operations instead of a mix of tensor_desc type-attribute and op attributes.

nmostafa · 2025-03-12T19:54:10Z

docs/rfcs/XeGPU.md

+
+The following conditions must hold:
+
+* workgroup_size must represent the number of subgroups in a workgroup for a kernel.


This is not necessarily true with warp-specialization.

nmostafa · 2025-03-12T20:11:12Z

docs/rfcs/XeGPU.md

+
+**Resulting WI Data Fragment**
+
+The distributed tensor for each subgroup has the same dimension as the work group level tensor. 


Not sure I follow what this means.

it should be "rank" not dimension. meaning that WG -> SG distribution is non-rank reducing transformation. right?
this maybe not true for SG -> WI distribution because WI level its 1D or 2D.

docs/rfcs/XeGPU.md

nmostafa · 2025-03-12T20:16:19Z

docs/rfcs/XeGPU.md

+   #wg_map_a = #xegpu.wg_map<sg_layout = [2, 2], sg_data = [32, 128], sg_order = [1, 0]>
+   %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<128x128xf16, #wg_map_a>
+```
+The table below shows the result tensor for each subgroup thread and its linear subgroup thread id.


avoid subgroup thread. Confusing.

OK. modified

nmostafa · 2025-03-12T20:17:52Z

docs/rfcs/XeGPU.md

+| [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 |
+| [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 |
+
+The `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose.


This assumes a gemm kernel. We might want to consider layout anchor operation that we can propagate from (e.g. xegpu.assignLayout %vec, {#wg_map_a}). This would cover both gemm and non-gemm cases, then we can remove the layout out of tensor_desc data type.

store is another anchor. would that be good enough for non-gemm use case?

What I meant is a dedicated instruction with the layout attribute part of the op definition, so it the attribute cannot be dropped by MLIR folding passes.

What is the specific anchor operations on non-gemm case you are looking for?
One alternative is to extend convert_layout op, which accepts identical map at the beginning of the propagation, to indicate the requirement of layout. But better to understand the use case better.
%vector_a = xegpu.convert_layout %vector_b {#sg_map_a #sg_map_a }: vector<256x256xfloat> into vector<256x256xfloat>

docs/rfcs/XeGPU.md

janghaeng-intel · 2025-03-12T21:00:52Z

docs/rfcs/XeGPU.md

+```
+
+For `reduction`,  `wg_map` of the input operand has an additional dimension to represent the dimension being reduced.  `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`.
+


I remember that we could introduce a new op for partial reduction. Is it a TODO or decided to go with vector op?

With the map supporting multiple dimension, and we don't think it is a must anymore.
But for future architectures, we may introduce the partial reduction to expose the hardware reduction semantics.

janghaeng-intel · 2025-03-12T21:06:04Z

docs/rfcs/XeGPU.md

+    } 
+  }
+```
+## Appendix 1.2 Gemm with transpose, broadcast, and reduction


Could we also have an example with input matrix in sg_order=[1,0]?

Garra1980 · 2025-03-15T00:06:11Z

docs/rfcs/XeGPU.md

+
+**Extended xegpu.sg_map**
+
+The extended `sg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. sg_map consists of four parameters:


should be "among subgroupS" I guess

charithaintc · 2025-03-14T22:42:21Z

docs/rfcs/XeGPU.md

+
+**distribution rule**
+
+The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data.


nit: rather than saying "broadcast", it is better to say these data is "shared" across all SGs in that dim.

This is following Triton's terminology.
See: https://github.com/triton-lang/triton/blob/main/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td#L648

charithaintc · 2025-03-14T22:43:56Z

docs/rfcs/XeGPU.md

+
+**Resulting WI Data Fragment**
+
+The distributed tensor for each subgroup has the same dimension as the work group level tensor. 


it should be "rank" not dimension. meaning that WG -> SG distribution is non-rank reducing transformation. right?
this maybe not true for SG -> WI distribution because WI level its 1D or 2D.

charithaintc · 2025-03-15T01:16:26Z

docs/rfcs/XeGPU.md

+```mlir
+   #sg_map_d = #xegpu.sg_map<sg_layout = [8, 4], sg_data = [32, 64], wi_layout=[1,16], wi_data = [1, 1], order=[1, 0]>
+   %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}:
+     vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>


Suggested change

vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>

vector<256x32xfloat>, vector<32x256xbf16>, vector<256x256xbf16>

charithaintc · 2025-03-15T01:20:01Z

docs/rfcs/XeGPU.md

+   %vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}:
+     vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>
+	   into vector<256x256xfloat>
+   //derived sg_map for input operands


there is still some flexibility on deciding the K dimension for sg_data in A and B here. for example sg_data_a = [32, 16] and sg_data_b = [16, 64] must still be a valid choice right?

nmostafa · 2025-03-17T17:03:44Z

docs/rfcs/XeGPU.md

+
+```mlir
+   #sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 128], wi_layout=[1,16], wi_data = [1, 1], order = [1, 0]>
+   %wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #sg_map_a>


Should be tensor_desc<128x128xf16, #sg_map_a> ?

nmostafa · 2025-03-17T17:28:46Z

docs/rfcs/XeGPU.md

+The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroups.
+
+```mlir
+   #sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 128], wi_layout=[1,16], wi_data = [1, 1], order = [1, 0]>


By exposing sg_data as another degree of freedom, we allow nested wrap-around of lanes: inside a sub-group tile, and over the entire tensor.
Consider:
#sg_map_a = #xegpu.sg_map<sg_layout = [2], sg_data = [32], wi_layout=[16], wi_data = [1], order = [1, 0]> tensor_desc<128xf16, #sg_map_a>

sg0:Lane0 will own elements 0, 16, 64, 80

Is there a motivation for expressing such complex walks, or can we just omit sg_data and have it inferred from wi_layout * wi_data ?

Yes. we use sg_data to support broadcast at wg to sg level, but "evenly distribute" at sg to wi level.
Say, Matrix A [64, 64] at wg level, we have 4 sg in 2x2 layout, each sg takes [32,64] (note the second dimension is broadcast to 2 layout in a row).
The map needs to be -
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 64], wi_layout=[1, 16], wi_data = [1, 1], order = [1, 0]>

but "evenly distribute" at sg to wi level

This might be confusing. Might as well just allow same broadcast/wrap-around semantics within a sub-group.

add inst_data remove scope remove the statements about lane_data implies packed data unit change the result of WI distribution being 1D. packing happens on 1D WI level code, not related to layout.

Jianhui-Li added 4 commits February 26, 2025 23:21

Update XeGPU.md

0dc2ce2

save work

Update XeGPU.md

b86bc26

save work

Update XeGPU.md

6533478

Update XeGPU.md

42f5f15

Garra1980 reviewed Feb 28, 2025

View reviewed changes

Garra1980 and others added 3 commits February 28, 2025 12:18

Some trivial changes

1c0bd4a

Update XeGPU.md

bc78f3a

save work

Update XeGPU.md

aee72dd

nmostafa reviewed Mar 6, 2025

View reviewed changes

chencha3 reviewed Mar 7, 2025

View reviewed changes

docs/rfcs/XeGPU.md Show resolved Hide resolved

chencha3 reviewed Mar 10, 2025

View reviewed changes

nmostafa reviewed Mar 12, 2025

View reviewed changes

janghaeng-intel reviewed Mar 12, 2025

View reviewed changes

Jianhui-Li changed the title ~~Extend XeGPU dialect with wg_map attribute~~ Extend XeGPU sg_map attribute to include sg_layout and order Mar 13, 2025

Update XeGPU.md

ebf115f

Jianhui-Li changed the title ~~Extend XeGPU sg_map attribute to include sg_layout and order~~ Extend XeGPU sg_map attribute to support workgroup level semantics Mar 13, 2025

Garra1980 reviewed Mar 15, 2025

View reviewed changes

charithaintc reviewed Mar 15, 2025

View reviewed changes

nmostafa reviewed Mar 17, 2025

View reviewed changes

Jianhui-Li added 4 commits March 21, 2025 23:29

Update XeGPU.md

ba353ba

Update XeGPU.md

7f3d358

add inst_data remove scope remove the statements about lane_data implies packed data unit change the result of WI distribution being 1D. packing happens on 1D WI level code, not related to layout.

Update XeGPU.md

a5dab91

Update XeGPU.md

3a4be53


		For a subgroup threads in 3-d sg_layout [dim_0, dim_1, dim_2], sg_order[2, 1, 0] maps a subgroup thread with 3-d index [x, y, z] to a linear subgroup thread index [z + dim_2y + dim_2dim_1x ], sg_order[1, 2, 0] maps to [y + dim_2z + dim_2dim_1x].

		When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor.


		The following conditions must hold:

		* workgroup_size must represent the number of subgroups in a workgroup for a kernel.


		Resulting WI Data Fragment

		The distributed tensor for each subgroup has the same dimension as the work group level tensor.

		```

		For `reduction`, `wg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`.


		Extended xegpu.sg_map

		The extended `sg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. sg_map consists of four parameters:


		distribution rule

		The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data.

	vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>
	vector<256x32xfloat>, vector<32x256xbf16>, vector<256x256xbf16>

Extend XeGPU sg_map attribute to support workgroup level semantics #1033

Are you sure you want to change the base?

Extend XeGPU sg_map attribute to support workgroup level semantics #1033

Uh oh!

Conversation

Jianhui-Li commented Feb 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmostafa Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li Mar 13, 2025 • edited by Garra1980 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmostafa Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nmostafa Mar 12, 2025 •

edited

Loading

Jianhui-Li Mar 13, 2025 •

edited by Garra1980

Loading

nmostafa Mar 20, 2025 •

edited

Loading