-
-
Notifications
You must be signed in to change notification settings - Fork 8
FAQ
Table of contents |
---|
Questions regarding specific facilities - Is kat::span like std::span or gsl::span ?- What's the use in ptx.cuh and on_device/ptx ?- Is the C string library optimized? Relation to other projects/libraries - Thrust - libcu++ - cub - cuda-api-wrappers Other general questions - You're missing a feature - You should drop a feature |
Have another question? You can write me and ask.
It's more std::span
. While it's true that gsl-lite
's span class supports CUDA, so I could theoretically just use that - I've added an adaptation / back-porting of the GNU libstdc++ library's std::span
implementation. I am of two minds about this choice, as on the one hand, it's more standards-compliant, but on the other, gsl::span
has some useful functionality not in the standard.
Of course, it's entirely C++11, despite its functionality being similar to std::span
(which was introduced in C++20).
If builtins::
functions translate to single PTX instructions, what's the use in ptx.cuh
and on_device/ptx
?
The builtins
code is type-generic - templated on the parameter type(s); the code in ptx/ is not templated (essentially), and involves concrete types. Also, some functions in builtins
use CUDA-supplied PTX instruction wrappers, while others use the ones we supply.
No, it is a naive, non-collaborative, per-thread implementation. With that in mind - there's nothing particular bad about to my knowledge.
For now, coverage is not assessed by an automated (and it's not clear this will work well with GPU kernel code; and there's the question of which types need to have templates instantiated for them, etc.)
A rough answer is that every freestanding function in the on_device
directory has (AFAICT and at the time of writing) an associated test case. Some of these may be more thorough, others may only test typical usage. There are no "I'm sure it works" functions. As far as container classes are concerned, I've adopted test suites from one or the other standard C++ libraries, but coverage may still be a bit lacking. Particularly, they have not been thoroughly tested with a plethora of element types.
See also issue #24. I can also say that more than 100,000 individual checks are made as part of the unit tests, if you like big numbers :-P
Thrust's basic use paradigm is calling STL-like functions (transform, copy, fill, sum/accumulate etc.) - but instead of them executing on the CPU, with inputs and outputs in main system memory - they execute on the GPU, on thrust-specific "device vectors"; see https://docs.nvidia.com/cuda/thrust/index.html#vectors.
cuda-kat
differs from this paradigm in several senses:
- It's a device-side library. Whatever STL-like functions it has are for execution within kernels (
__global__
functions) and__device__
functions. - Most functionality is at a more local scope, i.e. block, or warp, or thread. Of course, all threads execute the same code, but the data is usually block/warp/thread-specific and the results only shared with the rest of the block/warp or not shared at all.
-
cuda-kat
does not require the use of its own owning data structures. Functions typically take pointers, or pointers and lengths. - The data structures
cuda-kat
provide are CUDA-device-side-enabled versions of standard-library data structures (e.g.span
,array
,tuple
).
so the aim of these two libraries is quite distinct.
Somewhat. cub is listed as providing:
- Parallel primitives
- Warp-wide "collective" primitives:
Cooperative warp-wide prefix scan, reduction, etc. - Block-wide "collective" primitives:
Cooperative I/O, sort, scan, reduction, histogram, etc.
- Warp-wide "collective" primitives:
- Device-wide primitives (invoked from the host side)
- Utilities
- Fancy iterators
- Thread and thread block "I/O" (access pattern encapsulating objects)
- PTX intrinsics
- Device, kernel, and storage management
cuda-kat doesn't provide :
- Device, kernel or storage management. See cuda-api-wrappers for these, except for allocation.
- Device-wide primitives run from the host side. It does provide a few templates for device-wide actions - which correspond to a data access pattern.
- Memory access-pattern-encapsulating objects. I'm not sure I like this approach; in partial stead of it, cuda-kat offers higher-order functions like
at_block_stride()
,at_grid_stride()
and others. - Fancy iterators. Not a bad idea actually.
cuda-kat does provide:
- PTX intrinsics - a strict superset of what you'll find in cub.
- Warp-wide "collective" primitives - some have corresponding primitives in cub, but many do not, especially ones which are more building-block-like.
- Block-wide "collective" primitives - some have corresponding primitives in cub, but many do not, especially ones which are more building-block-like.
(Will be written soon.)
cuda-api-wrappers
is a header-mostly library by the same (main) author as cuda-kat
. It is a library for host-side code, while cuda-kat is a library of device-side code, or code which can work on both the device-side and the host side. cuda-api-wrappers
is simply a different way to use the CUDA Runtime API - it does not provide additional functionality.
There is no need to use cuda-api-wrappers
for writing kernels (or related host-side code) with cuda-kat
, and vice-versa - cuda-api-wrappers
doesn't care what you wrote your kernel with and is not aware of cuda-kat
facilities. However, the unit tests for cuda-kat
(which are built only as an option) do require cuda-api-wrappers
.
That's quite possible; open an issue about it and try to convince me. Better still - fork the repository,, implement that piece of functionality, add your code, write a unit test and make a Pull Request (PR).
Open an issue about it and try to convince me to remove it.