-
Notifications
You must be signed in to change notification settings - Fork 0
RFC: Add xegpu transform ops #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1c0906a to
df1b9a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an initial pass. I will go through it again later. It currently looks to me that we need to generalize the layout setting. Currently, the implementation is limited to support a few specific cases only. Can we run analysis inside __transform_main, and query the analysis result for each Op or Value?
mlir/include/mlir/Dialect/XeGPU/TransformOps/XeGPUTransformOps.td
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason to limit the rank to be 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I have only considered 2D inputs, namely 2D matmul op. This can be generalized as needed. My goal here was to demonstrate the transform ops with a 2D matmul and use that as the first CI test. Generalization can be added in the same PR or in a follow up with more tests.
Why would you need such analysis? Normally, I think, it is sufficient to inspect the payload op handle and transform op arguments for, say, verification purposes. |
I mean from transform perspective, how to systematically assign layouts to each OpResult and OpOperand in a kernel |
rename tileIndex to operandIndex remove all references to dpas ops where possible
df1b9a3 to
8b11bfd
Compare
|
Updates:
|
|
|
||
| let summary = "Hoists xegpu tile descriptor ops outside the containing loop"; | ||
| let description = [{ | ||
| Hoists `xepu.create_nd_tdesc` out of the loop. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pass may become unnecessary as we are transitioning to a new create_nd_tdesc definition: nd_tdesc created without offset and move offset to load_nd. Create_nd_tdesc would become loop_invariant.
Referring to this PRs:
a.1. make offset option for create_nd_tdesc (llvm#148335)
a.2. add optional offsets for load_nd and store_nd/prefetch_nd. (llvm#149424)
You may look at Imex innersource github issue#1151 for more background info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes I'm aware of this planned change. It implies some changes to the transform ops - it should in fact make the logic simpler in most cases. Hoisting the desc ops is still needed but indeed we might be able to use existing hoist patterns instead of an xegpu specific method. We can address this issue once the new load_nd-offset pipeline is complete. In the meantime, on my behalf, we could upstream these transform ops so that we can support linalg.matmul lowering.
|
|
||
| let summary = "Adds xegpu prefetch ops to matmul operand tiles."; | ||
| let description = [{ | ||
| Given an xegpu operation residing in a `scf.for` loop, this transform inserts cooperative `xegpu.prefetch` operations for the A (index = 0) or B (index = 1) operand. The prefetch tile size is determined by the `sg_layout` and `sg_data` attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the input is a xegpu DPAS op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the implementation only supports DPAS op at the moment.
| auto layoutAttr = | ||
| createLayoutAttr(rewriter.getContext(), sgLayout, sgData, instData); | ||
| descOp = setDescLayout(rewriter, descOp, layoutAttr); | ||
| if (operandIndex == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is current implementation still assume the operation is dpasOp? if so, maybe you can add a TODO note.
Extend support in LLDB for WebAssembly. This PR adds a new Process plugin (ProcessWasm) that extends ProcessGDBRemote for WebAssembly targets. It adds support for WebAssembly's memory model with separate address spaces, and the ability to fetch the call stack from the WebAssembly runtime. I have tested this change with the WebAssembly Micro Runtime (WAMR, https://github.com/bytecodealliance/wasm-micro-runtime) which implements a GDB debug stub and supports the qWasmCallStack packet. ``` (lldb) process connect --plugin wasm connect://localhost:4567 Process 1 stopped * thread #1, name = 'nobody', stop reason = trace frame #0: 0x40000000000001ad wasm32_args.wasm`main: -> 0x40000000000001ad <+3>: global.get 0 0x40000000000001b3 <+9>: i32.const 16 0x40000000000001b5 <+11>: i32.sub 0x40000000000001b6 <+12>: local.set 0 (lldb) b add Breakpoint 1: where = wasm32_args.wasm`add + 28 at test.c:4:12, address = 0x400000000000019c (lldb) c Process 1 resuming Process 1 stopped * thread #1, name = 'nobody', stop reason = breakpoint 1.1 frame #0: 0x400000000000019c wasm32_args.wasm`add(a=<unavailable>, b=<unavailable>) at test.c:4:12 1 int 2 add(int a, int b) 3 { -> 4 return a + b; 5 } 6 7 int (lldb) bt * thread #1, name = 'nobody', stop reason = breakpoint 1.1 * frame #0: 0x400000000000019c wasm32_args.wasm`add(a=<unavailable>, b=<unavailable>) at test.c:4:12 frame #1: 0x40000000000001e5 wasm32_args.wasm`main at test.c:12:12 frame llvm#2: 0x40000000000001fe wasm32_args.wasm ``` This PR is based on an unmerged patch from Paolo Severini: https://reviews.llvm.org/D78801. I intentionally stuck to the foundations to keep this PR small. I have more PRs in the pipeline to support the other features/packets. My motivation for supporting Wasm is to support debugging Swift compiled to WebAssembly: https://www.swift.org/documentation/articles/wasm-getting-started.html
|
Upstreaming these ops is deferred due to the ongoing changes in the xegpu dialect. Closing. |
A recent change adding a new sanitizer kind (via Sanitizers.def) was reverted in c74fa20 ("Revert "[Clang][CodeGen] Introduce the AllocToken SanitizerKind" (llvm#162413)"). The reason was this ASan report, when running the test cases in clang/test/Preprocessor/print-header-json.c: ``` ==clang==483265==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7d82b97e8b58 at pc 0x562cd432231f bp 0x7fff3fad0850 sp 0x7fff3fad0848 READ of size 16 at 0x7d82b97e8b58 thread T0 #0 0x562cd432231e in __copy_non_overlapping_range<const unsigned long *, const unsigned long *> zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:2144:38 #1 0x562cd432231e in void std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::__init_with_size[abi:nn220000]<unsigned long const*, unsigned long const*>(unsigned long const*, unsigned long const*, unsigned long) zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:2685:18 llvm#2 0x562cd41e2797 in __init<const unsigned long *, 0> zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:2673:3 llvm#3 0x562cd41e2797 in basic_string<const unsigned long *, 0> zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:1174:5 llvm#4 0x562cd41e2797 in clang::ASTReader::ReadString(llvm::SmallVectorImpl<unsigned long> const&, unsigned int&) clang/lib/Serialization/ASTReader.cpp:10171:15 llvm#5 0x562cd41fd89a in clang::ASTReader::ParseLanguageOptions(llvm::SmallVector<unsigned long, 64u> const&, llvm::StringRef, bool, clang::ASTReaderListener&, bool) clang/lib/Serialization/ASTReader.cpp:6475:28 llvm#6 0x562cd41eea53 in clang::ASTReader::ReadOptionsBlock(llvm::BitstreamCursor&, llvm::StringRef, unsigned int, bool, clang::ASTReaderListener&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) clang/lib/Serialization/ASTReader.cpp:3069:11 llvm#7 0x562cd4204ab8 in clang::ASTReader::ReadControlBlock(clang::serialization::ModuleFile&, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, clang::serialization::ModuleFile const*, unsigned int) clang/lib/Serialization/ASTReader.cpp:3249:15 llvm#8 0x562cd42097d2 in clang::ASTReader::ReadASTCore(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, clang::serialization::ModuleFile*, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, long, long, clang::ASTFileSignature, unsigned int) clang/lib/Serialization/ASTReader.cpp:5182:15 llvm#9 0x562cd421ec77 in clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, clang::serialization::ModuleFile**) clang/lib/Serialization/ASTReader.cpp:4828:11 llvm#10 0x562cd3d07b74 in clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool) clang/lib/Frontend/CompilerInstance.cpp:1805:27 llvm#11 0x562cd3d0b2ef in clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<clang::IdentifierLoc>, clang::Module::NameVisibilityKind, bool) clang/lib/Frontend/CompilerInstance.cpp:1956:31 llvm#12 0x562cdb04eb1c in clang::Preprocessor::HandleHeaderIncludeOrImport(clang::SourceLocation, clang::Token&, clang::Token&, clang::SourceLocation, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2423:49 llvm#13 0x562cdb042222 in clang::Preprocessor::HandleIncludeDirective(clang::SourceLocation, clang::Token&, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2101:17 llvm#14 0x562cdb043366 in clang::Preprocessor::HandleDirective(clang::Token&) clang/lib/Lex/PPDirectives.cpp:1338:14 llvm#15 0x562cdafa84bc in clang::Lexer::LexTokenInternal(clang::Token&, bool) clang/lib/Lex/Lexer.cpp:4512:7 llvm#16 0x562cdaf9f20b in clang::Lexer::Lex(clang::Token&) clang/lib/Lex/Lexer.cpp:3729:24 llvm#17 0x562cdb0d4ffa in clang::Preprocessor::Lex(clang::Token&) clang/lib/Lex/Preprocessor.cpp:896:11 llvm#18 0x562cd77da950 in clang::ParseAST(clang::Sema&, bool, bool) clang/lib/Parse/ParseAST.cpp:163:7 [...] 0x7d82b97e8b58 is located 0 bytes after 3288-byte region [0x7d82b97e7e80,0x7d82b97e8b58) allocated by thread T0 here: #0 0x562cca76f604 in malloc zorg-test/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:67:3 #1 0x562cd1cce452 in safe_malloc llvm/include/llvm/Support/MemAlloc.h:26:18 llvm#2 0x562cd1cce452 in llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long) llvm/lib/Support/SmallVector.cpp:151:15 llvm#3 0x562cdbe1768b in grow_pod llvm/include/llvm/ADT/SmallVector.h:139:11 llvm#4 0x562cdbe1768b in grow llvm/include/llvm/ADT/SmallVector.h:525:41 llvm#5 0x562cdbe1768b in reserve llvm/include/llvm/ADT/SmallVector.h:665:13 llvm#6 0x562cdbe1768b in llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*) llvm/lib/Bitstream/Reader/BitstreamReader.cpp:230:10 llvm#7 0x562cd41ee8ab in clang::ASTReader::ReadOptionsBlock(llvm::BitstreamCursor&, llvm::StringRef, unsigned int, bool, clang::ASTReaderListener&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) clang/lib/Serialization/ASTReader.cpp:3060:49 llvm#8 0x562cd4204ab8 in clang::ASTReader::ReadControlBlock(clang::serialization::ModuleFile&, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, clang::serialization::ModuleFile const*, unsigned int) clang/lib/Serialization/ASTReader.cpp:3249:15 llvm#9 0x562cd42097d2 in clang::ASTReader::ReadASTCore(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, clang::serialization::ModuleFile*, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, long, long, clang::ASTFileSignature, unsigned int) clang/lib/Serialization/ASTReader.cpp:5182:15 llvm#10 0x562cd421ec77 in clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, clang::serialization::ModuleFile**) clang/lib/Serialization/ASTReader.cpp:4828:11 llvm#11 0x562cd3d07b74 in clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool) clang/lib/Frontend/CompilerInstance.cpp:1805:27 llvm#12 0x562cd3d0b2ef in clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<clang::IdentifierLoc>, clang::Module::NameVisibilityKind, bool) clang/lib/Frontend/CompilerInstance.cpp:1956:31 llvm#13 0x562cdb04eb1c in clang::Preprocessor::HandleHeaderIncludeOrImport(clang::SourceLocation, clang::Token&, clang::Token&, clang::SourceLocation, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2423:49 llvm#14 0x562cdb042222 in clang::Preprocessor::HandleIncludeDirective(clang::SourceLocation, clang::Token&, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2101:17 llvm#15 0x562cdb043366 in clang::Preprocessor::HandleDirective(clang::Token&) clang/lib/Lex/PPDirectives.cpp:1338:14 llvm#16 0x562cdafa84bc in clang::Lexer::LexTokenInternal(clang::Token&, bool) clang/lib/Lex/Lexer.cpp:4512:7 llvm#17 0x562cdaf9f20b in clang::Lexer::Lex(clang::Token&) clang/lib/Lex/Lexer.cpp:3729:24 llvm#18 0x562cdb0d4ffa in clang::Preprocessor::Lex(clang::Token&) clang/lib/Lex/Preprocessor.cpp:896:11 llvm#19 0x562cd77da950 in clang::ParseAST(clang::Sema&, bool, bool) clang/lib/Parse/ParseAST.cpp:163:7 [...] SUMMARY: AddressSanitizer: heap-buffer-overflow clang/lib/Serialization/ASTReader.cpp:10171:15 in clang::ASTReader::ReadString(llvm::SmallVectorImpl<unsigned long> const&, unsigned int&) ``` The reason is this particular RUN line: ``` // RUN: env CC_PRINT_HEADERS_FORMAT=json CC_PRINT_HEADERS_FILTERING=direct-per-file CC_PRINT_HEADERS_FILE=%t.txt %clang -fsyntax-only -I %S/Inputs/print-header-json -isystem %S/Inputs/print-header-json/system -fmodules -fimplicit-module-maps -fmodules-cache-path=%t %s -o /dev/null ``` which was added in 8df194f ("[Clang] Support includes translated to module imports in -header-include-filtering=direct-per-file (llvm#156756)"). The problem is caused by an incremental build reusing stale cached module files (.pcm) that are no longer binary-compatible with the updated compiler. Adding a new sanitizer option altered the implicit binary layout of the serialized LangOptions data structure. The build + test system is oblivious to such changes. When the new compiler attempted to read the old module file (from the previous test invocation), it misinterpreted the data due to the layout mismatch, resulting in a heap-buffer-overflow. Unfortunately Clang's PCM format does not encode nor detect version mismatches here; a more graceful failure mode would be preferable. For now, fix the test to be more robust with incremental build + test.
Otherwise debug-info is stripped, which influences the language of the
current frame.
Also, set explicit breakpoint because Windows seems to not obey the
debugtrap.
Log from failing test on Windows:
```
(lldb) command source -s 0 'lit-lldb-init-quiet'
Executing commands in 'D:\test\lit-lldb-init-quiet'.
(lldb) command source -C --silent-run true lit-lldb-init
(lldb) target create "main.out"
Current executable set to 'D:\test\main.out' (x86_64).
(lldb) settings set interpreter.stop-command-source-on-error false
(lldb) command source -s 0 'with-target.input'
Executing commands in 'D:\test\with-target.input'.
(lldb) expr blah
^
error: use of undeclared identifier 'blah'
note: Falling back to default language. Ran expression as 'Objective C++'.
(lldb) run
Process 29404 launched: 'D:\test\main.out' (x86_64)
Process 29404 stopped
* thread #1, stop reason = Exception 0x80000003 encountered at address 0x7ff7b3df7189
frame #0: 0x00007ff7b3df718a main.out
-> 0x7ff7b3df718a: xorl %eax, %eax
0x7ff7b3df718c: popq %rcx
0x7ff7b3df718d: retq
0x7ff7b3df718e: int3
(lldb) expr blah
^
error: use of undeclared identifier 'blah'
note: Falling back to default language. Ran expression as 'Objective C++'.
(lldb) expr -l objc -- blah
^
error: use of undeclared identifier 'blah'
note: Expression evaluation in pure Objective-C not supported. Ran expression as 'Objective C++'.
(lldb) expr -l c -- blah
^
error: use of undeclared identifier 'blah'
note: Expression evaluation in pure C not supported. Ran expression as 'ISO C++'.
```
xegputransform ops for matrix multiplicationPurpose
This document outlines new
transform.xegputransform operations.linalgoperations to thexegpudialect, although such capability would be useful in a number of user applications. The proposed XeGPU transform operations aim to fill the gaps for loweringlinalg.matmuloperations. They also address necessary tiling, prefetching etc. optimizations necessary to achieve good performance on Xe GPUs. Going forward, the XeGPU transform ops can be extended to support more workloads.scf.forop) which allows defining differentiated transforms for each op (e.g., a main loop and remainder loop after tiling).New Operations
The new transform ops are:
transform.xegpu.set_operand_layout: Given a handle to an anchor op, likexegpu.dpas, setsxegpu.layoutattributes to its operands. Currently only supports DPAS ops. DPAS op must have been tiled to workgroup (WG) size, and reduction loop K size. This op sets thesg_layout,sg_dataandinst_datalayout attributes. *transform.xegpu.insert_prefetch: Inserts prefetch operations for an xegpu op operands. Currently only supports DPAS op. Setssg_layout,sg_dataattributes, emits prefetch ops, and inserts them in the reduction loop.transform.xegpu.hoist_desc_ops: Hoistsxegpu.create_nd_descops out of the loop.transform.xegpu.set_gpu_launch_threads: Given a handle to agpu.launchop, sets the number of gpu threads. This op is a workaround to ensure correct number of threads in the launch op.Example: 4k matrix multiplication payload
Consider the following 4k
linalg.matmulpayload function defined withtensors.Applying existing transforms
We can apply workgroup (WG) and reduction dimension (K) tiling using the following upstream transform operations on the matched
linalg.matmulop handle:This produces an
scf.forallloop for the WG tiling, followed by anscf.forreduction loop. The matmul op has shape(256x32, 32x256) -> 256x256.We can now vectorize the
linalg.matmulop and hoist the loop-invariant C tile read/store ops. Hoisting can be safely applied as we are working on tensors, thus avoiding any memory side-effects.Next we bufferize the payload function and drop the redundant function return value.
The matrix multiplication is now defined with the
vectorops andmemrefs.We can now apply existing
gpudialect passes to map this loop nest to gpu blocks and treads (WG and SG). We first convert thescf.forallloop toscf.parallel. Thegpu-map-parallel-loopsexpects twoscf.parallelloops, one for WG and one for SG level. At this stage, however, we only have the WG loop, so the pass assumes a single GPU thread. We will fix this later.We can now apply the
convert-vector-to-xegpupass to convert thevectordialect ops toxegpuops and foldmemref.subviewops into thexegpudescriptor op.The reduction loop now reads:
... %2 = xegpu.create_nd_tdesc %arg2[%0, %1] : memref<4096x4096xf16> -> !xegpu.tensor_desc<256x256xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> %3 = xegpu.load_nd %2 : !xegpu.tensor_desc<256x256xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> -> vector<256x256xf16> %4 = scf.for %arg15 = %c0 to %c4096 step %c32 iter_args(%arg16 = %3) -> (vector<256x256xf16>) { %5 = xegpu.create_nd_tdesc %arg0[%0, %arg15] : memref<4096x4096xf16> -> !xegpu.tensor_desc<256x32xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> %6 = xegpu.load_nd %5 : !xegpu.tensor_desc<256x32xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> -> vector<256x32xf16> %7 = xegpu.create_nd_tdesc %arg1[%arg15, %1] : memref<4096x4096xf16> -> !xegpu.tensor_desc<32x256xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> %8 = xegpu.load_nd %7 : !xegpu.tensor_desc<32x256xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> -> vector<32x256xf16> %9 = xegpu.dpas %6, %8, %arg16 : vector<256x32xf16>, vector<32x256xf16>, vector<256x256xf16> -> vector<256x256xf16> scf.yield %9 : vector<256x256xf16> } xegpu.store_nd %4, %2 : vector<256x256xf16>, !xegpu.tensor_desc<256x256xf16, #xegpu.block_tdesc_attr<memory_space = global, array_length = 1 : i64, boundary_check = false>> ...Applying
xegputransform opsThe above
xegpuIR must be further optimized to get good performance. This is where the newxegputransform ops come to play.The
transform.xegpu.set_operand_layoutoperationThe DPAS op is defined at the WG level without any indication on how it should be distributed to the subgroups. To this end, we apply the
transform.xegpu.set_operand_layoutop which sets thexegpu.layoutattributes. We first match the DPAS op, and then apply the desiredsg_layout,sg_data, andinst_dataattributes for the A tile (operandindex = 0):The B and C tiles are handled analogously:
Setting the layout to the C tile also sets the
layout_result_0attribute to thexegpu.dpasop. The final reduction loop with layout attributes is:The
transform.xegpu.hoist_desc_opsoperationThe above IR still has the A and B descriptor ops within the reduction loop. These can be hoisted with the
transform.xegpu.hoist_desc_opsop:The descriptor op is moved out of the loop, adding the descriptor to the loop's
iter_argsand adding an offset update op in the loop.This op replaces the
scf.forop and therefore the loop handle is invalidated and an another handle to the new loop is returned.The resulting IR can now lowered further using the
xegpu-wg-to-sg-distributeandxegpu-blockingpasses.The
transform.xegpu.insert_prefetchoperationCooperative prefetching can be added using the
transform.xegpu.insert_prefetchop. The op takes a handle to the reduction loop and the DPAS op whose operands we want to prefetch. For the A tile, we prefetch the256x32tile using 32 threads along the first dimension, i.e. each thread fetches a8x32tile:This emits the descriptor, update offset and prefetch ops in the reduction loop:
The B tile prefetches are handled analogously. Here we choose to prefetch the
32x256tile using 32 threads in[8, 4]layout, each thread fetching again a8x32tile:The
transform.xegpu.set_gpu_launch_threadsoperationFinally, we fix the number of treads in the
gpu.launchop with the following op:Full lowering schedule
Combining the above transformations we can now write the full lowering schedule for the matmul operation:
The above schedule exposes the following parameters:
The output IR after the above schedule has been applied can be found here (now outdated).
Performance
The above schedule yields ~200 TFLOPS/s performance on a single PVC tile and passes correctness test.
Discussion / Future work
xegpu.set_operand_layoutandxegpu.insert_prefetchops to support other ops thanxegpu.dpasop.inst_datatile between a load and use. In the long term, we could havexegpu.set_operand_layoutandxegpu.set_result_layoutops that set attrs for individual ops and use the XeGPU layout propagation mechanism (under development) to handle layout conversions.xegpu.set_gpu_launch_threadsshould be handled differently in the future, preferably using suitablegpudialect transform ops. It is included for the time being so that the IR can be executed correctly.linalg-matmul-to-xegpu{wg-tile=256,256 sg-tile=32,64 k-tile=32 dpas-tile=8,16,16 a-prefetch=8,32 b-prefetch=8,32 a-load=32,16 b-load=32,16}. This pass applies the same transforms to all DPAS ops.