Skip to content

Gap analysis for Memory Management  #151

Closed
@PokhodenkoSA

Description

@PokhodenkoSA

Study the Memory Management section and document with examples equivalent features in numba-dppy. Identify missing features, e.g. device arrays.

  • Data transfer
CUDA DPPY
numba.cuda.device_array -
numba.cuda.device_array_like -
numba.cuda.to_device -
numba.cuda.as_cuda_array (Create a DeviceNDArray from any object that implements the cuda array interface.) -
numba.cuda.is_cuda_array -
  • Device arrays in CUDA
CUDA DPPY
numba.cuda.cudadrv.devicearray.DeviceNDArray -
copy_to_host -
is_c_contiguous -
is_f_contiguous -
ravel -
reshape -
  • Pinned memory in CUDA / no explicit mechanism to request pinned memory in SYCL

There are generally two ways which host memory can be allocated: * When not using the cl::sycl::property::buffer::use_host_pointer property, the SYCL runtime will allocate host memory when required. This uses the implementation-specific mechanism, which can attempt to request pinned memory.
If the cl::sycl::property::buffer::use_host_pointer property is used, then the SYCL runtime will not allocate host memory and will use the pointer provided when the buffer is constructed. In this case, it is the users responsibility to ensure any requirements for memory allocation to allow pinned memory are satisfied.
Users can manually allocate pinned memory on the host, and hand it over to the SYCL implementation. This will often involve allocating host memory with a suitable alignment and multiple, and sometimes can be managed manually using OS specific operations such as mmap and munmap.

CUDA DPPY
numba.cuda.pinned -
numba.cuda.pinned_array -
  • Streams in CUDA / Queue in SYCL

In a similar fashion to CUDA streams, SYCL queues submit command groups for execution asynchronously. However, SYCL is a higher-level programming model, and data transfer operations are implicitly deduced from the dependencies of the kernels submitted to any queue. Furthermore, SYCL queues can map to multiple OpenCL queues, enabling transparent overlapping of data-transfer and kernel execution. The SYCL runtime handles the execution order of the different command groups (kernel + dependencies) automatically across multiple queues in different devices.

CUDA DPPY
numba.cuda.stream queue
numba.cuda.default_stream -
numba.cuda.legacy_default_stream -
numba.cuda.per_thread_default_stream -
numba.cuda.external_stream -
numba.cuda.cudadrv.driver.Stream -
auto_synchronize -
synchronize event
  • Per-block Shared memory and thread synchronization in CUDA / Local memory in SYCL
CUDA DPPY
numba.cuda.shared.array dppy.local.static_alloc
numba.cuda.syncthreads dppy.barrier
  • Per-thread Local memory / Private memory in SYCL
CUDA DPPY
numba.cuda.local.array -
  • Constant memory in CUDA / Constant memory in SYCL
CUDA DPPY
numba.cuda.const.array_like -
  • Deallocation Behavior

https://numba.pydata.org/numba-doc/dev/cuda/external-memory.html#cuda-emm-plugin

CUDA DPPY
numba.cuda.defer_cleanup -

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions