Description
"Evaluate the GPU Reduction features that are provided by numba.cuda
. We currently do not have anything similar and the output of this step should be a design to support a similar @reduce
decorator for numba-dppy
.
Links:
- https://github.com/IntelPython/numba-dppy/blob/main/HowTo.rst#reduction
- https://github.com/IntelPython/numba-dppy/blob/main/numba_dppy/examples/sum_reduction.py
Related issues:
- Numba Reduction operations (+=, *=) not supported on GPU #71
- Numba-DPPY assumes reduction when there is not reduction present. #78
- Reductions operations unsupported error in L2 distance #96
- Support automatically offloading of NumPy reduction operations and prange reductions to SYCL devices #93
Features:
- Reduction kernels (convert a simple binary operation into a reduction kernel)
@cuda.reduce
def sum_reduce(a, b):
return a + b
res = sum_reduce(arr)
numba-dppy
does not provide decorator for reduction.
From HowTo:
This can be implemented by invoking the kernel once, but that requires support for local device memory and barrier, which is a work in progress.
See example.
2. Support lambdas
sum_reduce = cuda.reduce(lambda a, b: a + b)
numba-dppy - No
3. Possible parameters
Parameter | CUDA | DPPY |
---|---|---|
Works with host and device arrays | Yes | ? |
Size of array | Yes | ? |
Return value or output parameter | Yes | ? |
Initial value | Yes | ? |
Pin to stream | Yes | Pin to queue? |
Side gap:
- Support for local device memory and barrier
- Local device memory and barriers could be supported sum_reduction_ocl.py#L18
Questions:
- Are local device memory and barrier supported now?
- Does ParallelFor device offloading support reductions? (Numba Reduction operations (+=, *=) not supported on GPU #71)
We provide all features to write your reductions but we should also provide @reduce
decorator.
Make it by hand is harder, @reduce
will autogenerate boilerplate code.
Also missing feature is to support reductions in parfor reductions. We should wait for MLIR support for it.
Missing features:
Missing example: