Skip to content

Commit 3934bc8

Browse files
Add gpu pipeline living doc (#164)
* Add GPU pipeline overview * Add integration section * Add the compilation-related decisions section * Add GPU pipeline outlook from the kernel lowering perspective
1 parent c4a8593 commit 3934bc8

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

doc/GPUPipeline.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# GPU Pipeline overview
2+
3+
This is a living document for GPU pipeline design. It's purpose is to keep the decision history and provide a guiding overview for development. We expect swift changes in the design as we go, so this mostly highlights guiding principles.
4+
5+
## Initial state description
6+
7+
The primary goal of the design is to adhere to certain qualities of the final solution.
8+
The spirit of the design is to reuse the existing parts, prefer upstream, and target long-term support in conjunction with other devices.
9+
10+
At the highest level, the pipeline can be split into three main stages:
11+
1. High-level platform-independent* transformations. These are to be shared with other flows (e.g., fusion).
12+
2. GPU-specific transformations. These are responsible for HW mapping and include everything until a SPIR-V is emitted.
13+
3. Code generation. This is tailored to a particular platform and is performed by a backend.
14+
15+
There are existing paths for each stage (sometimes multiple, the choice affects other parts). A short landscape description follows.
16+
17+
### Landscape
18+
There are two primary ways of generating GPU target binary code, both going through IGC: scalar and vector paths.
19+
20+
The scalar (aka SIMT) path relies on IGC's vectorization capabilities to map logical threads to SIMD lanes. Handling synchronization (e.g., cross-lane communication) is the main burden for otherwise transformation-amenable representation.
21+
22+
The vector (aka SIMD) path in IGC expects the IR to have a certain explicitly-vectorized form, primarily built via a set of intrinsics (VC-intinsics). The main complexity of the approach for the pipeline is handling data/compute distribution between those vectors and handling such a deviation from other GPU types lowering paths.
23+
24+
Today, there are two main options to reach the low-level compiler:
25+
1. Lower to SPIR-V dialect and serialize it (IMEX).
26+
2. Lower to LLVM IR and use the SPIR-V Translator (Triton).
27+
28+
Both produce a SPIR-V that can be consumed by IGC.
29+
30+
Going up the pipeline, the abstractions needed to express specific ISA semantics (e.g., DPAS and nd-load required for efficient contraction implementation) are covered by XeGPU dialect. The dialect allows for both SIMT and SIMD -style lowering.
31+
32+
TODO: gpu(x), linalg-to-scf, gpu-map-parallel-loops.
33+
34+
### Integration
35+
There are three major point of integration that affect the way the pipeline is built:
36+
1. Input representation.
37+
2. Memory management.
38+
3. Runtime interfaces.
39+
40+
The primary input for our pipelines is linalg on tesnors with named ops. These are pretty flexible (adding more to the upstream is more-or-less straightforward) and cover a lot of ground.
41+
42+
Memory management has to deal with weight caching, dynamic shapes, input/output handling, etc. Certain decisions on the compiler user side lead to additional complications in the pipeline.
43+
For example, having to deal with 'logical' tenors for OneDNN imposes constraints on constant folding.
44+
45+
The choice of runtime interface defines how much additional logic should reside in the pipeline. For managed devices (such as a GPU) there are two distinct options:
46+
1. The compiler only emits a binary for the target device.
47+
2. The compiler emits a binary and a launch stub that interacts with an appropriate runtime.
48+
The latter provides more context, and thus, potentially more opportunities for optimization. The former gives more control to the user and simplifies the pipeline.
49+
50+
### The path of least resistance
51+
First milestone for the pipeline creation aims at taking what's working now and putting it together.
52+
53+
This includes:
54+
- Going through XeGPU dialect
55+
- Using IMEX's XeGPU lowering
56+
- Adapting TPP's linalg-to-xegpu
57+
58+
## Decisions
59+
60+
### Compilation
61+
* Generate the code with kernel outlining. The motivation is that the compiler can take over some of the scheduling-related tasks. The implies the interface with a framework needs to expose synchronization mechanism (e.g., pass a GPU queue). This also affects kernel caching. JITed or non-JITed execution (GPU module converted to serialized SPIR-V or to an actual target-specific binary) are similar cases from that point of view. Both will need to retrieve the artifact and pass it to the lowered from `gpu.launch` runtime call.
62+
* To align with the future pipelines, the target representation for the gpu module is LLVM. The actual path to the binary will be hidden inside `gpu-module-to-binary` implementation. From the kernel lowering perspective, the outlook of the target pipeline looks like:
63+
64+
```
65+
builtin.module(
66+
gpu-kernel-outlining,
67+
xe-attach-target{chip=xe_3 O=3},
68+
gpu.module(convert-gpu-to-llvm-spv),
69+
gpu-to-llvm,
70+
gpu-module-to-binary
71+
)
72+
```

0 commit comments

Comments
 (0)