Description
Study the Memory Management section and document with examples equivalent features in numba-dppy. Identify missing features, e.g. device arrays.
- Data transfer
CUDA | DPPY |
---|---|
numba.cuda.device_array | - |
numba.cuda.device_array_like | - |
numba.cuda.to_device | - |
numba.cuda.as_cuda_array (Create a DeviceNDArray from any object that implements the cuda array interface.) | - |
numba.cuda.is_cuda_array | - |
- Device arrays in CUDA
CUDA | DPPY |
---|---|
numba.cuda.cudadrv.devicearray.DeviceNDArray | - |
copy_to_host | - |
is_c_contiguous | - |
is_f_contiguous | - |
ravel | - |
reshape | - |
- Pinned memory in CUDA / no explicit mechanism to request pinned memory in SYCL
There are generally two ways which host memory can be allocated: * When not using the cl::sycl::property::buffer::use_host_pointer property, the SYCL runtime will allocate host memory when required. This uses the implementation-specific mechanism, which can attempt to request pinned memory.
If the cl::sycl::property::buffer::use_host_pointer property is used, then the SYCL runtime will not allocate host memory and will use the pointer provided when the buffer is constructed. In this case, it is the users responsibility to ensure any requirements for memory allocation to allow pinned memory are satisfied.
Users can manually allocate pinned memory on the host, and hand it over to the SYCL implementation. This will often involve allocating host memory with a suitable alignment and multiple, and sometimes can be managed manually using OS specific operations such as mmap and munmap.
CUDA | DPPY |
---|---|
numba.cuda.pinned | - |
numba.cuda.pinned_array | - |
- Streams in CUDA / Queue in SYCL
In a similar fashion to CUDA streams, SYCL queues submit command groups for execution asynchronously. However, SYCL is a higher-level programming model, and data transfer operations are implicitly deduced from the dependencies of the kernels submitted to any queue. Furthermore, SYCL queues can map to multiple OpenCL queues, enabling transparent overlapping of data-transfer and kernel execution. The SYCL runtime handles the execution order of the different command groups (kernel + dependencies) automatically across multiple queues in different devices.
CUDA | DPPY |
---|---|
numba.cuda.stream | queue |
numba.cuda.default_stream | - |
numba.cuda.legacy_default_stream | - |
numba.cuda.per_thread_default_stream | - |
numba.cuda.external_stream | - |
numba.cuda.cudadrv.driver.Stream | - |
auto_synchronize | - |
synchronize | event |
- Per-block Shared memory and thread synchronization in CUDA / Local memory in SYCL
CUDA | DPPY |
---|---|
numba.cuda.shared.array | dppy.local.static_alloc |
numba.cuda.syncthreads | dppy.barrier |
- Per-thread Local memory / Private memory in SYCL
CUDA | DPPY |
---|---|
numba.cuda.local.array | - |
- Constant memory in CUDA / Constant memory in SYCL
CUDA | DPPY |
---|---|
numba.cuda.const.array_like | - |
- Deallocation Behavior
https://numba.pydata.org/numba-doc/dev/cuda/external-memory.html#cuda-emm-plugin
CUDA | DPPY |
---|---|
numba.cuda.defer_cleanup | - |