Skip to content

Extend XeGPU sg_map attribute to support workgroup level semantics #1033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Jianhui-Li
Copy link
Contributor

Please review these guidelines to help with the review process:

  • Have you provided a meaningful PR description?
  • Have you added a test, a reproducer, or a reference to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?
  • Have you organized your commits logically and ensured each can be built by itself?

By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs.

**Attribute xegpu.wg_map**
`wg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. wg_map consists of two parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two->three

@@ -701,6 +700,332 @@ An example on how to perform transpose using load with chunk_size in SIMT flavor

```

## Workgroup level XeGPU Operations

By allowing XeGPU operating on workgroup level data size, it provides a concise IR for tensor compiler instead of multiple level nested loop IR for subgroup and work item level operation. To enable XeGPU operate the workgroup level, we introduce `wg_map` attribute to specify how the data is distributed across subgroups. `wg_map` enables tensor compiler to express the cooperative operation among subgroups by specifying a `wg_map` to partition data among subgroups without modifying the IR representation other required when using loop nest IR. The attribute allows tensor compiler to control the block size for both the workgroup and subgroup and perform autotuning as the number of subgroups, layout, and tensor size per subgroups are critical performance knobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest using lane instead of wi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both lane and work item used in GPU dialect.
The problem of using "lane_layout" is more like describing the layout of hardware.
"wi_layout" convey the meaning that it is about the layout of wi threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a bit what is the reason you likes "lane" more than "wi"? @nmostafa Still struggling on the name.

```mlir
sg_data_size = sg_data[0] × sg_data[1]
workgroup_size = sg_layout[0] × sg_layout[1]
tensor_size = tensor_desc[0] × tensor_desc[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using tensor_shape is better than tensor_desc.


For a subgroup threads in 3-d sg_layout [dim_0, dim_1, dim_2], sg_order[2, 1, 0] maps a subgroup thread with 3-d index [x, y, z] to a linear subgroup thread index [z + dim_2*y + dim_2*dim_1*x ], sg_order[1, 2, 0] maps to [y + dim_2*z + dim_2*dim_1*x].

When a wg_map attribute is attached to a tensor descriptor, load/store/dpas will operate at the workgroup level. The wg_map attribute must be specified when creating the tensor descriptor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wg_map is a property of vector values (hardware registers), not in-memory layout of the tensor tile. Maybe better to just use attributes for all operations instead of a mix of tensor_desc type-attribute and op attributes.


The following conditions must hold:

* workgroup_size must represent the number of subgroups in a workgroup for a kernel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily true with warp-specialization.


**Resulting WI Data Fragment**

The distributed tensor for each subgroup has the same dimension as the work group level tensor.
Copy link
Contributor

@nmostafa nmostafa Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow what this means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be "rank" not dimension. meaning that WG -> SG distribution is non-rank reducing transformation. right?
this maybe not true for SG -> WI distribution because WI level its 1D or 2D.

#wg_map_a = #xegpu.wg_map<sg_layout = [2, 2], sg_data = [32, 128], sg_order = [1, 0]>
%wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<128x128xf16, #wg_map_a>
```
The table below shows the result tensor for each subgroup thread and its linear subgroup thread id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid subgroup thread. Confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. modified

| [ 64:95, 0:127] | [0, 0], [0, 1] | 0 , 1 |
| [ 96:127, 0:127] | [1, 0], [1, 1] | 2 , 3 |

The `wg_map` attribute propagates from the matrix multiplication ops to other ops. Since we can't attatch the `wg_map` attribute to MLIR vector data type, we attach the attribute to vector type-based operations temporarily within the workgroup distribution pass. The `wg_map` attribute propagation can be performed from output to input, or the other direction. We describes below the propagation rules from output to input for typical operations including dpas, reduction, broadcast, shape_cast, and transpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes a gemm kernel. We might want to consider layout anchor operation that we can propagate from (e.g. xegpu.assignLayout %vec, {#wg_map_a}). This would cover both gemm and non-gemm cases, then we can remove the layout out of tensor_desc data type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store is another anchor. would that be good enough for non-gemm use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is a dedicated instruction with the layout attribute part of the op definition, so it the attribute cannot be dropped by MLIR folding passes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the specific anchor operations on non-gemm case you are looking for?
One alternative is to extend convert_layout op, which accepts identical map at the beginning of the propagation, to indicate the requirement of layout. But better to understand the use case better.
%vector_a = xegpu.convert_layout %vector_b {#sg_map_a #sg_map_a }: vector<256x256xfloat> into vector<256x256xfloat>

```

For `reduction`, `wg_map` of the input operand has an additional dimension to represent the dimension being reduced. `sg_layout` must be the same and the new dimension as `1`. The new dimension of `sg_data` must be the same as the input tensor size, and the other dimension must be the same as the output's `wg_map`. The new dimension of `sg_order` should not change the existing ordering specified by the output's `wg_map`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that we could introduce a new op for partial reduction. Is it a TODO or decided to go with vector op?

Copy link
Contributor Author

@Jianhui-Li Jianhui-Li Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the map supporting multiple dimension, and we don't think it is a must anymore.
But for future architectures, we may introduce the partial reduction to expose the hardware reduction semantics.

}
}
```
## Appendix 1.2 Gemm with transpose, broadcast, and reduction

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also have an example with input matrix in sg_order=[1,0]?

@Jianhui-Li Jianhui-Li changed the title Extend XeGPU dialect with wg_map attribute Extend XeGPU sg_map attribute to include sg_layout and order Mar 13, 2025
@Jianhui-Li Jianhui-Li changed the title Extend XeGPU sg_map attribute to include sg_layout and order Extend XeGPU sg_map attribute to support workgroup level semantics Mar 13, 2025

**Extended xegpu.sg_map**

The extended `sg_map` specifies how a n-d tensor (defined by the tensor descriptor) is partitioned among subgroup within a workgroup. sg_map consists of four parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "among subgroupS" I guess


**distribution rule**

The tensor_desc is distributed to sg_data x sg_layout along each dimension in a round-robin fashion. If sg_data[i] x sg_layout[i] < tensor_desc[i], there is data left after all subgroups are assigned for the first round, the rest data will wrap around and be assigned to the first subgroup until the data is completely assigned. If sg_data[i] x sg_layout[i] > tensor_desc[i], the data may be used up before all subgroups are assigned. In this case, we broadcast the tensor data to multiple subgroups by repeating the data assignment to the rest subgroups along that dimension until the all subgroups get data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rather than saying "broadcast", it is better to say these data is "shared" across all SGs in that dim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


**Resulting WI Data Fragment**

The distributed tensor for each subgroup has the same dimension as the work group level tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be "rank" not dimension. meaning that WG -> SG distribution is non-rank reducing transformation. right?
this maybe not true for SG -> WI distribution because WI level its 1D or 2D.

```mlir
#sg_map_d = #xegpu.sg_map<sg_layout = [8, 4], sg_data = [32, 64], wi_layout=[1,16], wi_data = [1, 1], order=[1, 0]>
%vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}:
vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>
vector<256x32xfloat>, vector<32x256xbf16>, vector<256x256xbf16>

%vector_d = xegpu.dpas %vector_a, %vector_b, %vector_c {#sg_map_d}:
vector<256x256xfloat>, vector<256x32xbf16>, vector<32x256xbf16>
into vector<256x256xfloat>
//derived sg_map for input operands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is still some flexibility on deciding the K dimension for sg_data in A and B here. for example sg_data_a = [32, 16] and sg_data_b = [16, 64] must still be a valid choice right?


```mlir
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 128], wi_layout=[1,16], wi_data = [1, 1], order = [1, 0]>
%wg_tdesc = xegpu.create_nd_tdesc %A[%m, %c0] : memref<1024x1024xf16> -> tensor_desc<16x128xf16, #sg_map_a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be tensor_desc<128x128xf16, #sg_map_a> ?

The workgroup creates a tensor descriptor [128, 128] and distributes to 4 subgroups with `sg_layout` [2,2], and each subgroup gets `sg_data` [32,128]. The first dimension is split and distributed to subgroups in two rounds, and the second dimension is assigned as whole to multiple subgroups.

```mlir
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 128], wi_layout=[1,16], wi_data = [1, 1], order = [1, 0]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By exposing sg_data as another degree of freedom, we allow nested wrap-around of lanes: inside a sub-group tile, and over the entire tensor.
Consider:
#sg_map_a = #xegpu.sg_map<sg_layout = [2], sg_data = [32], wi_layout=[16], wi_data = [1], order = [1, 0]> tensor_desc<128xf16, #sg_map_a>

sg0:Lane0 will own elements 0, 16, 64, 80

Is there a motivation for expressing such complex walks, or can we just omit sg_data and have it inferred from wi_layout * wi_data ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. we use sg_data to support broadcast at wg to sg level, but "evenly distribute" at sg to wi level.
Say, Matrix A [64, 64] at wg level, we have 4 sg in 2x2 layout, each sg takes [32,64] (note the second dimension is broadcast to 2 layout in a row).
The map needs to be -
#sg_map_a = #xegpu.sg_map<sg_layout = [2, 2], sg_data = [32, 64], wi_layout=[1, 16], wi_data = [1, 1], order = [1, 0]>

Copy link
Contributor

@nmostafa nmostafa Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but "evenly distribute" at sg to wi level

This might be confusing. Might as well just allow same broadcast/wrap-around semantics within a sub-group.

add inst_data
remove scope
remove the statements about lane_data implies packed data unit
change the result of WI distribution being 1D. packing happens on 1D WI level code, not related to layout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants