Skip to content

lamarrr/runtime_lto_examples

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro

This is a collection of examples showing how to combine offline compilation of cuda kernels with runtime LTO linking ( instead of compile time linking ).

Offline model of <<< >>> kernel launches

When using <<< >>> to launch kernels the entire program is required to be known offline either when compiling ( whole compilation ) or linking ( separable compilation).

NVRTC model

In a full nvrtc model kernels are compiled at runtime but from CUDA/C++ code. This model is very similar to the OpenGL shader compilation pipeline. But instead of being restricted to GLSL language you have full access to C++.

Some of the strengths of nvrtc are:

  • You only compile the exact kernels that you launch
  • You can compose with user defined nvrtc code
  • You only need to code-gen for the exact GPU you are on

Some of the downsides of nvrtc are:

  • You need to write lots of glue code to go from code fragments to a kernel
  • You are compiling C++ files at runtime which can have significantly high compile time overheads
  • You need to ship any C++ headers your kernels depend on

Note:

Lots of the downsides of nvrtc are mitigated by projects like https://github.com/NVIDIA/jitify

Runtime LTO model

The runtime LTO model was first outlined in https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/ .

The strength of this model is that it allows for composibility of iteration/access and computation. We can build components as LTO-IR when performance is critical ( or compile times are long ) but still have other components generated by

For example we can compile the following CUDA kernel(2 float in, 1 float) as an LTO-IR fragment:

extern __device__ float compute(float x, float y, float &r);

extern "C" __global__ void kernel(float* x, float* y, float* r, size_t n) {
  auto tidx = common::grid_1d::global_thread_id();
  auto const stride = common::grid_1d::grid_stride();
  while (tidx < n) {
    compute(x[tidx], y[tidx], z[tidx]);
    tidx += stride;
  }
}

That would allow us to slot in at runtime different compute algorithms such as saxpy, min, max, and so on. Therefore we keep our

Examples

This is the starting point for understanding runtime linking a single offline compiled FATBINs (with LTO-IR code)

This example uses saxpy as a way to show multiple new features:

  • Link together multiple LTO-IR fatbins
  • Show how to re-use the same iteration kernel for different computation

All of the algorithms in the saxpy example are memory bandwidth bound. So this example shows how viable LTO-IR fragments are for compute bounds algorithms. For this our proxy is mandlebrot

  • Shows how to compose lto-ir fatbins for compute bound problems

The advanced example takes the saxpy example and generalizes the logic a step further and adds in two major new features:

  • Shows how to mix fatbins and nvrtc fragments
  • Shows how to launch kernels based on the input datatype ( float vs double )

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 48.2%
  • C++ 29.1%
  • CMake 22.7%