Skip to content

Templated kernel authoring support

Eyal Rozenberg edited this page Apr 17, 2020 · 3 revisions

Motivation

  • More commonly and naturally - when you want to generalize some type-specific code of yours.

Let's consider the third option. So, you have your kernel function and you start replacing a specific type (say, int or float) with a template parameter T. Immediately, compilation errors start piling up:

  • You can't use your dynamic shared memory, defined with extern __shared__ T shared_mem[]; - NVCC says the definitions conflict for different T's
  • Some of your atomic operation calls stop compiling, because they're not available for your new choices for T.
  • Worse, some of them still compile, but later on you realize the atomic operations get messed up because of implicit conversions, to types for which the atomic operation is defined.
  • Some of your mathematical functions can't use the same name, like fmin() or __clz(), need to be renamed for various types: llmin(), umin(), __clzll() etc. What will you do? You can't use any of the single names, nor will an if() do, because the compiler will likely complain about type mismatch for at least one of the variants for each T.

... and so on. What a mess!

A tiny taste of kat

If, instead, you were to use these functions via cuda-kat, they would already be templated:

auto smem = kat::shared_memory::dynamic::proxy<int>();
// ...
auto first = foo();
auto second = bar();
auto thread_best = kat::minimum<double>(first, second);
kat::atomic::min(overall_best_addr, thread_best);

just change the int and double into T's - and you're done. The wrapper functions used would be optimized away in favor of the same builtins/intrinsics as before, when they exist, and a minimally-slower (or not slower at all) alternative when PTX doesn't allow for an appropriate single instruction.

It should be emphasized that the cuda-kat facilities for addressing these issues...

  • Do not require any specialty types or data structures;
  • Do not incur any overhead (when the compiler is inlining), particularly in terms of register use; and
  • Do not force you into any particular coding style (although they may influence your style).

Specific facilities provided

  • Uniformly-templated wrappers for CUDA on-device builtins/intrinsics

  • Efficient implementations for functions which "feel like" builtins, but aren't actually available as such:

  • CUDA-kernel-compatible versions of standard library data types and containers: std::tuple, std::array, std::span, with supporting freestanding functions and traits.

  • Common collaboration primitives, uniformly templated, for selected or arbitrary types:

    ... including collaborative implementations of standard-library algorithms, from <algorithm> and <numeric>.

  • Runtime and compile-time math primitives.

  • For debugging: std::stringstream and std::cout-like classes, wrapping the use of CUDA's on-device printf() support.

  • Uniformly-templated wrappers for atomic operations - with compare-and-swap based fallback implementations for types not supported directly.

Clone this wiki locally