This is a collection of examples showing how to combine offline compilation of cuda kernels with runtime LTO linking ( instead of compile time linking ).
When using <<< >>>
to launch kernels the entire program is required to be known
offline either when compiling ( whole compilation ) or linking ( separable compilation).
In a full nvrtc model kernels are compiled at runtime but from CUDA/C++ code. This model is very similar to the OpenGL shader compilation pipeline. But instead of being restricted to GLSL language you have full access to C++.
Some of the strengths of nvrtc are:
- You only compile the exact kernels that you launch
- You can compose with user defined nvrtc code
- You only need to code-gen for the exact GPU you are on
Some of the downsides of nvrtc are:
- You need to write lots of glue code to go from code fragments to a kernel
- You are compiling C++ files at runtime which can have significantly high compile time overheads
- You need to ship any C++ headers your kernels depend on
Note:
Lots of the downsides of nvrtc are mitigated by projects like https://github.com/NVIDIA/jitify
The runtime LTO model was first outlined in https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/ .
The strength of this model is that it allows for composibility of iteration/access and computation. We can build components as LTO-IR when performance is critical ( or compile times are long ) but still have other components generated by
For example we can compile the following CUDA kernel(2 float in, 1 float) as an LTO-IR fragment:
extern __device__ float compute(float x, float y, float &r);
extern "C" __global__ void kernel(float* x, float* y, float* r, size_t n) {
auto tidx = common::grid_1d::global_thread_id();
auto const stride = common::grid_1d::grid_stride();
while (tidx < n) {
compute(x[tidx], y[tidx], z[tidx]);
tidx += stride;
}
}
That would allow us to slot in at runtime different compute
algorithms such as
saxpy
, min
, max
, and so on. Therefore we keep our
This is the starting point for understanding runtime linking a single offline compiled FATBINs (with LTO-IR code)
This example uses saxpy as a way to show multiple new features:
- Link together multiple LTO-IR fatbins
- Show how to re-use the same iteration kernel for different computation
All of the algorithms in the saxpy example are memory bandwidth bound. So this example shows how viable LTO-IR fragments are for compute bounds algorithms. For this our proxy is mandlebrot
- Shows how to compose lto-ir fatbins for compute bound problems
The advanced example takes the saxpy example and generalizes the logic a step further and adds in two major new features:
- Shows how to mix fatbins and nvrtc fragments
- Shows how to launch kernels based on the input datatype ( float vs double )