Description
openedon Oct 19, 2021
The CUDA driver/runtime APIs have a cuMallocPitch
, returning a pitch that's larger than the size of a row in order to include padding. See https://stackoverflow.com/a/16119944/587034; this can improve performance due to better memory load behavior. Although https://forums.developer.nvidia.com/t/what-is-the-stream-ordered-equivalent-of-cudamallocpitch/189574 may suggest this isn't as relevant on today's hardware anymore, it would be an interesting experiment.
I'm not sure whether we should add a separate CuPitchedArray
type, or whether we can generalize CuArray
without penalizing every array access (can CuDeviceArray
just contain the per-dimension stride instead of sizes + strides?). Either way, we should probably have a way to dispatch to the 2d/3d memcpy's, if possible.