-
-
Notifications
You must be signed in to change notification settings - Fork 8
Templated kernel authoring support
- More commonly and naturally - when you want to generalize some type-specific code of yours.
Let's consider the third option. So, you have your kernel function and you start replacing a specific type (say, int
or float
) with a template parameter T
. Immediately, compilation errors start piling up:
- You can't use your dynamic shared memory, defined with
extern __shared__ T shared_mem[];
- NVCC says the definitions conflict for differentT
's - Some of your atomic operation calls stop compiling, because they're not available for your new choices for
T
. - Worse, some of them still compile, but later on you realize the atomic operations get messed up because of implicit conversions, to types for which the atomic operation is defined.
- Some of your mathematical functions can't use the same name, like
fmin()
or__clz()
, need to be renamed for various types:llmin()
,umin()
,__clzll()
etc. What will you do? You can't use any of the single names, nor will anif()
do, because the compiler will likely complain about type mismatch for at least one of the variants for eachT
.
... and so on. What a mess!
If, instead, you were to use these functions via cuda-kat, they would already be templated:
auto smem = kat::shared_memory::dynamic::proxy<int>();
// ...
auto first = foo();
auto second = bar();
auto thread_best = kat::minimum<double>(first, second);
kat::atomic::min(overall_best_addr, thread_best);
just change the int
and double
into T
's - and you're done. The wrapper functions used would be optimized away in favor of the same builtins/intrinsics as before, when they exist, and a minimally-slower (or not slower at all) alternative when PTX doesn't allow for an appropriate single instruction.
It should be emphasized that the cuda-kat facilities for addressing these issues...
- Do not require any specialty types or data structures;
- Do not incur any overhead (when the compiler is inlining), particularly in terms of register use; and
- Do not force you into any particular coding style (although they may influence your style).
-
Uniformly-templated wrappers for CUDA on-device builtins/intrinsics
-
Efficient implementations for functions which "feel like" builtins, but aren't actually available as such:
-
CUDA-kernel-compatible versions of standard library data types and containers:
std::tuple
,std::array
,std::span
, with supporting freestanding functions and traits. -
Common collaboration primitives, uniformly templated, for selected or arbitrary types:
- warp-scope (e.g. selection a leader lane, shuffling)
- block-scope (e.g. broadcasting some data over to all block threads)
- grid-scope
... including collaborative implementations of standard-library algorithms, from
<algorithm>
and<numeric>
. -
Runtime and compile-time math primitives.
-
For debugging:
std::stringstream
andstd::cout
-like classes, wrapping the use of CUDA's on-deviceprintf()
support. -
Uniformly-templated wrappers for atomic operations - with compare-and-swap based fallback implementations for types not supported directly.