Description
High-level objective:
- Provide users familiar with
numba.cuda
an easy guide to start usingnumba-dppy
. At the end of this gap-analysis we should be able to provide use-cases showing how anumba.cuda
program can be transliterated into anumba-dppy
program. - Identify features that are supported in
numba.cuda
but are not yet supported bynumba-dppy
. - Open separate tickets to track the design and missing features in
numba-dppy
.
Detailed Goals
Examples of docs to orient to:
- Produce a guide and documentation similar to https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers. We should eventually include the documentation in a future gh-pages site for numba-dppy.
- Produce a guide for numba-dppy similar to https://numba.pydata.org/numba-doc/dev/cuda/index.html
Sections to analyze:
- Study the Writing a CUDA Kernel section and document with examples equivalent features in
numba-dppy
. (Gaps in writing kernel features #157) - Study the Memory Management section and document with examples equivalent features in
numba-dppy
. Identify missing features, e.g. device arrays. (Gap analysis for Memory Management #151) - Study the Writing Device Functions section and document the equivalent feature in
numba-dppy
. (Gaps in Writing Device Functions #152) - Evaluate if Supported Python features in CUDA Python are currently supported for
numba-dppy
. (Gaps in supported Python features in CUDA Python #155 ) - Evaluate if Supported Atomics Operation are currently supported for
numba-dppy
. (Gaps in supported Atomics Operation #156 ) - Evaluate the RNG feature supported by
numba.cuda
and develop a plan on how to include support for similar functionality fordppy.kernel
. (Gaps in RNG function supported by numba_dpex.kernel v/s numba.cuda #159) - Evaluate Debugging features supported by
numba.cuda
. We probably do not need a simulator feature as for us adppy.kernel
can be debugged by changing thedpctl.device_context
to CPU. But the rest of the debugging functionality should be evaluated. (Gaps in Debugging features #158) - Evaluate the GPU Reduction features that are provided by
numba.cuda
. We currently do not have anything similar and the output of this step should be a design to support a similar@reduce
decorator fornumba-dppy
. (Gaps in GPU Reduction #153) - Evaluate the level of support for Vectroize and GUVectorize functions in
numba.cuda
. We do support@vectorize
but support for@guvectorize
is missing. (Gaps in Vectroize and GUVectorize #154)
Other topics:
- Evaluate how to support pipelined asynchronous execution of GPU kernels to overlap compute and host-device data movement. Refer the example in the following comment. (Add support for pipelined GPU kernel execution to numba-dpex #147)
The goal of this exercise is to identify features that are missing and need to be added and develop a guide for users to start using numba-dppy
more easily.
Not supported topics:
Also, not all features provided by numba.cuda
are relevant or necessary. An example is the support of NumPy functions that CUDA supports, but numba-dppy should not support as it is really an anti-pattern (#146).
Topics for dpCtl:
Some of the sections from the Numba for CUDA GPUs are not relevant to numba-dppy
and should be handled in dpctl
in our case:
- Device Management (Gaps in Device Management dpctl#240)
- External Memory (Gaps in External Memory dpctl#252 )
- Sharing CUDA Memory (Gaps in IPC for Device Memory dpctl#245)
Sources of information:
- Docs
- Tests
- Examples (where are examples for
numba.cuda
?)
Acceptance criteria for analysis:
- Tickets for missing features
- Comparison/Transition from
numba.cuda
tonumba-dppy
- Example for
numba-dppy
- Documentation - explanation of the feature
Documentation should contain:
- Explanation of the feature
- Examples for
numba-dppy
- Missing features in
numba-dppy
- Transition from
numba.cuda
tonumba-dppy
- Limitations of
numba-dppy
Missing features
-
- It is necessary to implement the elements of the data transfer #162 Data transfer (device_array, device_array_like, to_device, as_dppy_array, is_dppy_array)
- Implement mechanism of the Device arrays #163 Device array
- Implement SYCL local memory #164 Local memory
- Implement SYCL Private memory #165 Private memory
- Implement SYCL Constant memory #166 Constant memory
-
- Rename dppy.kernel and dppy.func to dppy.jit #189 dppy.jit
- Support for Calling Device Functions from numba.jit functions #205 Call from
@numba.jit
- Support for Calling Device Functions from Ufuncs #193 Call from
@vectorize
-
Supported Python features in CUDA Python
- Make python statements available in the kernel #169 statements (raise, assert)
- Implement support for types complex, bool, None, tuple on the kernel #170 built-in types (complex, bool, None, tuple)
- Implement support for built-in python functions on the kernel #171 built-in functions (complex, enumerate, min, max, zip)
- Make the cmath library working on the kernel #172 cmath library
- Add support for all operators in kernels #178 operators (&, &=, <<=, ~=, |=, >>=, ^=, >>, ^)
-
- Add support for atomic operations inside a kernel #161 min, max, nanmin, nanmax, compare_and_swap for int, float and uint types
-
- Random number generation on GPU #202 RNG in dppy.kernel
-
- Enable debug option inside dppy.kernel #174
@dppy.kernel(debug=True)
- Implement changing the dpctl.device_context to CPU while debugging #198 pdb
- Enable debug option inside dppy.kernel #174
-
@reduce
decorator #182@reduce
-
- Support
@guvectorize
#192@guvectorize
- Support for Calling Device Functions from Ufuncs #193 Calling Device Functions
- Support for passing intra-device arrays to Ufuncs #194 Intra-device arrays
- Support for launching Ufuncs asynchronously #195 Asynchronous launching
- Support for explicit control maximum size of the thread block for Ufuncs #196 Control thread block size
- SYCL level diagnostics for offloading #207 Offload diagnostics
- Support
-
- Support for multi-GPU machines (sort devices by performance) dpctl#246 Multi-GPU machines
- Functions for selecting current device dpctl#247 Functions for selecting device
- Functions for enumerating devices dpctl#248 Device list
-
- Implementing an EMM Plugin dpctl#254 Implementing an EMM Plugin
- Implement plugin for EMM Deallocation Behavior dpctl#253 Implement plugin for EMM Deallocation Behavior
- Implement The Host-Only Memory Manager dpctl#255 Implement The Host-Only Memory Manager
- Implement Memory Pointers #699 Implement Memory Pointers
- Implement Memory Info dpctl#257 Implement Memory Info
- Implement IPC dpctl#258 Implement IPC
-
- Support IPC for Device Memory dpctl#249 IPC for Device Memory
-
Pipelined asynchronous execution of GPU kernels to overlap compute and host-device data movement
- Add support for pipelined GPU kernel execution to numba-dpex #147 Pipelined execution
Missing examples
-
- Example for sum reduction with local memory and barriers #186 Sum reduction in one call
-
- Example for RNG on GPU #203 RNG via dpNP