|
| 1 | +# GPU Pipeline overview |
| 2 | + |
| 3 | +This is a living document for GPU pipeline design. It's purpose is to keep the decision history and provide a guiding overview for development. We expect swift changes in the design as we go, so this mostly highlights guiding principles. |
| 4 | + |
| 5 | +## Initial state description |
| 6 | + |
| 7 | +The primary goal of the design is to adhere to certain qualities of the final solution. |
| 8 | +The spirit of the design is to reuse the existing parts, prefer upstream, and target long-term support in conjunction with other devices. |
| 9 | + |
| 10 | +At the highest level, the pipeline can be split into three main stages: |
| 11 | +1. High-level platform-independent* transformations. These are to be shared with other flows (e.g., fusion). |
| 12 | +2. GPU-specific transformations. These are responsible for HW mapping and include everything until a SPIR-V is emitted. |
| 13 | +3. Code generation. This is tailored to a particular platform and is performed by a backend. |
| 14 | + |
| 15 | +There are existing paths for each stage (sometimes multiple, the choice affects other parts). A short landscape description follows. |
| 16 | + |
| 17 | +### Landscape |
| 18 | +There are two primary ways of generating GPU target binary code, both going through IGC: scalar and vector paths. |
| 19 | + |
| 20 | +The scalar (aka SIMT) path relies on IGC's vectorization capabilities to map logical threads to SIMD lanes. Handling synchronization (e.g., cross-lane communication) is the main burden for otherwise transformation-amenable representation. |
| 21 | + |
| 22 | +The vector (aka SIMD) path in IGC expects the IR to have a certain explicitly-vectorized form, primarily built via a set of intrinsics (VC-intinsics). The main complexity of the approach for the pipeline is handling data/compute distribution between those vectors and handling such a deviation from other GPU types lowering paths. |
| 23 | + |
| 24 | +Today, there are two main options to reach the low-level compiler: |
| 25 | +1. Lower to SPIR-V dialect and serialize it (IMEX). |
| 26 | +2. Lower to LLVM IR and use the SPIR-V Translator (Triton). |
| 27 | + |
| 28 | +Both produce a SPIR-V that can be consumed by IGC. |
| 29 | + |
| 30 | +Going up the pipeline, the abstractions needed to express specific ISA semantics (e.g., DPAS and nd-load required for efficient contraction implementation) are covered by XeGPU dialect. The dialect allows for both SIMT and SIMD -style lowering. |
| 31 | + |
| 32 | +TODO: gpu(x), linalg-to-scf, gpu-map-parallel-loops. |
| 33 | + |
| 34 | +### Integration |
| 35 | +There are three major point of integration that affect the way the pipeline is built: |
| 36 | +1. Input representation. |
| 37 | +2. Memory management. |
| 38 | +3. Runtime interfaces. |
| 39 | + |
| 40 | +The primary input for our pipelines is linalg on tesnors with named ops. These are pretty flexible (adding more to the upstream is more-or-less straightforward) and cover a lot of ground. |
| 41 | + |
| 42 | +Memory management has to deal with weight caching, dynamic shapes, input/output handling, etc. Certain decisions on the compiler user side lead to additional complications in the pipeline. |
| 43 | +For example, having to deal with 'logical' tenors for OneDNN imposes constraints on constant folding. |
| 44 | + |
| 45 | +The choice of runtime interface defines how much additional logic should reside in the pipeline. For managed devices (such as a GPU) there are two distinct options: |
| 46 | +1. The compiler only emits a binary for the target device. |
| 47 | +2. The compiler emits a binary and a launch stub that interacts with an appropriate runtime. |
| 48 | +The latter provides more context, and thus, potentially more opportunities for optimization. The former gives more control to the user and simplifies the pipeline. |
| 49 | + |
| 50 | +### The path of least resistance |
| 51 | +First milestone for the pipeline creation aims at taking what's working now and putting it together. |
| 52 | + |
| 53 | +This includes: |
| 54 | +- Going through XeGPU dialect |
| 55 | +- Using IMEX's XeGPU lowering |
| 56 | +- Adapting TPP's linalg-to-xegpu |
| 57 | + |
| 58 | +## Decisions |
| 59 | + |
| 60 | +### Compilation |
| 61 | +* Generate the code with kernel outlining. The motivation is that the compiler can take over some of the scheduling-related tasks. The implies the interface with a framework needs to expose synchronization mechanism (e.g., pass a GPU queue). This also affects kernel caching. JITed or non-JITed execution (GPU module converted to serialized SPIR-V or to an actual target-specific binary) are similar cases from that point of view. Both will need to retrieve the artifact and pass it to the lowered from `gpu.launch` runtime call. |
| 62 | +* To align with the future pipelines, the target representation for the gpu module is LLVM. The actual path to the binary will be hidden inside `gpu-module-to-binary` implementation. From the kernel lowering perspective, the outlook of the target pipeline looks like: |
| 63 | + |
| 64 | +``` |
| 65 | +builtin.module( |
| 66 | + gpu-kernel-outlining, |
| 67 | + xe-attach-target{chip=xe_3 O=3}, |
| 68 | + gpu.module(convert-gpu-to-llvm-spv), |
| 69 | + gpu-to-llvm, |
| 70 | + gpu-module-to-binary |
| 71 | +) |
| 72 | +``` |
0 commit comments