Intro

This is a collection of examples showing how to combine offline compilation of cuda kernels with runtime LTO linking ( instead of compile time linking ).

Offline model of `<<< >>>` kernel launches

When using <<< >>> to launch kernels the entire program is required to be known offline either when compiling ( whole compilation ) or linking ( separable compilation).

NVRTC model

In a full nvrtc model kernels are compiled at runtime but from CUDA/C++ code. This model is very similar to the OpenGL shader compilation pipeline. But instead of being restricted to GLSL language you have full access to C++.

Some of the strengths of nvrtc are:

You only compile the exact kernels that you launch
You can compose with user defined nvrtc code
You only need to code-gen for the exact GPU you are on

Some of the downsides of nvrtc are:

You need to write lots of glue code to go from code fragments to a kernel
You are compiling C++ files at runtime which can have significantly high compile time overheads
You need to ship any C++ headers your kernels depend on

Note:

Lots of the downsides of nvrtc are mitigated by projects like https://github.com/NVIDIA/jitify

Runtime LTO model

The runtime LTO model was first outlined in https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/ .

The strength of this model is that it allows for composibility of iteration/access and computation. We can build components as LTO-IR when performance is critical ( or compile times are long ) but still have other components generated by

For example we can compile the following CUDA kernel(2 float in, 1 float) as an LTO-IR fragment:

extern __device__ float compute(float x, float y, float &r);

extern "C" __global__ void kernel(float* x, float* y, float* r, size_t n) {
  auto tidx = common::grid_1d::global_thread_id();
  auto const stride = common::grid_1d::grid_stride();
  while (tidx < n) {
    compute(x[tidx], y[tidx], z[tidx]);
    tidx += stride;
  }
}

That would allow us to slot in at runtime different compute algorithms such as saxpy, min, max, and so on. Therefore we keep our

Examples

Hello World

This is the starting point for understanding runtime linking a single offline compiled FATBINs (with LTO-IR code)

SAXPY

This example uses saxpy as a way to show multiple new features:

Link together multiple LTO-IR fatbins
Show how to re-use the same iteration kernel for different computation

Mandlebrot

All of the algorithms in the saxpy example are memory bandwidth bound. So this example shows how viable LTO-IR fragments are for compute bounds algorithms. For this our proxy is mandlebrot

Shows how to compose lto-ir fatbins for compute bound problems

Advanced Composition

The advanced example takes the saxpy example and generalizes the logic a step further and adds in two major new features:

Shows how to mix fatbins and nvrtc fragments
Shows how to launch kernels based on the input datatype ( float vs double )

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cpp		cpp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intro

Offline model of `<<< >>>` kernel launches

NVRTC model

Runtime LTO model

Examples

Hello World

SAXPY

Mandlebrot

Advanced Composition

About

Uh oh!

Releases

Packages

Languages

License

lamarrr/runtime_lto_examples

Folders and files

Latest commit

History

Repository files navigation

Intro

Offline model of <<< >>> kernel launches

NVRTC model

Runtime LTO model

Examples

Hello World

SAXPY

Mandlebrot

Advanced Composition

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Offline model of `<<< >>>` kernel launches

Packages