FAQ

Table of contents
Questions regarding specific facilities - Is `kat::span` like `std::span` or `gsl::span`? - What's the use in `ptx.cuh` and `on_device/ptx`? - Is the C string library optimized? Relation to other projects/libraries - Thrust - libcu++ - cub - cuda-api-wrappers Other general questions - You're missing a feature - You should drop a feature

Have another question? You can write me and ask.

Questions regarding specific facilities

Is the cuda-kat's `kat::span` class more like `std::span` or like `gsl::span`?

It's more std::span. While it's true that gsl-lite's span class supports CUDA, so I could theoretically just use that - I've added an adaptation / back-porting of the GNU libstdc++ library's std::span implementation. I am of two minds about this choice, as on the one hand, it's more standards-compliant, but on the other, gsl::span has some useful functionality not in the standard.

Of course, it's entirely C++11, despite its functionality being similar to std::span (which was introduced in C++20).

If `builtins::` functions translate to single PTX instructions, what's the use in `ptx.cuh` and `on_device/ptx`?

The builtins code is type-generic - templated on the parameter type(s); the code in ptx/ is not templated (essentially), and involves concrete types. Also, some functions in builtins use CUDA-supplied PTX instruction wrappers, while others use the ones we supply.

Is the C string library implementation in `cuda-kat` optimized?

No, it is a naive, non-collaborative, per-thread implementation. With that in mind - there's nothing particular bad about to my knowledge.

How good is the unit test coverage of the library?

For now, coverage is not assessed by an automated (and it's not clear this will work well with GPU kernel code; and there's the question of which types need to have templates instantiated for them, etc.)

A rough answer is that every freestanding function in the on_device directory has (AFAICT and at the time of writing) an associated test case. Some of these may be more thorough, others may only test typical usage. There are no "I'm sure it works" functions. As far as container classes are concerned, I've adopted test suites from one or the other standard C++ libraries, but coverage may still be a bit lacking. Particularly, they have not been thoroughly tested with a plethora of element types.

See also issue #24. I can also say that more than 100,000 individual checks are made as part of the unit tests, if you like big numbers :-P

Relation to other projects/libraries

Doesn't nVIDIA's Thrust library already provide a lot of standard-library like functionality?

Thrust's basic use paradigm is calling STL-like functions (transform, copy, fill, sum/accumulate etc.) - but instead of them executing on the CPU, with inputs and outputs in main system memory - they execute on the GPU, on thrust-specific "device vectors"; see https://docs.nvidia.com/cuda/thrust/index.html#vectors.

cuda-kat differs from this paradigm in several senses:

It's a device-side library. Whatever STL-like functions it has are for execution within kernels (__global__ functions) and __device__ functions.
Most functionality is at a more local scope, i.e. block, or warp, or thread. Of course, all threads execute the same code, but the data is usually block/warp/thread-specific and the results only shared with the rest of the block/warp or not shared at all.
cuda-kat does not require the use of its own owning data structures. Functions typically take pointers, or pointers and lengths.
The data structures cuda-kat provide are CUDA-device-side-enabled versions of standard-library data structures (e.g. span, array, tuple).

so the aim of these two libraries is quite distinct.

Does `cuda-kat` overlap the functionality in nVIDIA/Duane Merril's `cub` library?

Somewhat. cub is listed as providing:

Parallel primitives
- Warp-wide "collective" primitives:
  Cooperative warp-wide prefix scan, reduction, etc.
- Block-wide "collective" primitives:
  Cooperative I/O, sort, scan, reduction, histogram, etc.
Device-wide primitives (invoked from the host side)
Utilities
- Fancy iterators
- Thread and thread block "I/O" (access pattern encapsulating objects)
- PTX intrinsics
Device, kernel, and storage management

cuda-kat doesn't provide :

Device, kernel or storage management. See cuda-api-wrappers for these, except for allocation.
Device-wide primitives run from the host side. It does provide a few templates for device-wide actions - which correspond to a data access pattern.
Memory access-pattern-encapsulating objects. I'm not sure I like this approach; in partial stead of it, cuda-kat offers higher-order functions like at_block_stride(), at_grid_stride() and others.
Fancy iterators. Not a bad idea actually.

cuda-kat does provide:

PTX intrinsics - a strict superset of what you'll find in cub.
Warp-wide "collective" primitives - some have corresponding primitives in cub, but many do not, especially ones which are more building-block-like.
Block-wide "collective" primitives - some have corresponding primitives in cub, but many do not, especially ones which are more building-block-like.

Does `cuda-kat` overlap the functionality in `libcu++`?

(Will be written soon.)

What is the relation between `cuda-kat` and `cuda-api-wrappers`?

cuda-api-wrappers is a header-mostly library by the same (main) author as cuda-kat. It is a library for host-side code, while cuda-kat is a library of device-side code, or code which can work on both the device-side and the host side. cuda-api-wrappers is simply a different way to use the CUDA Runtime API - it does not provide additional functionality.

There is no need to use cuda-api-wrappers for writing kernels (or related host-side code) with cuda-kat, and vice-versa - cuda-api-wrappers doesn't care what you wrote your kernel with and is not aware of cuda-kat facilities. However, the unit tests for cuda-kat (which are built only as an option) do require cuda-api-wrappers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ