Gaps in GPU Reduction

"Evaluate the [GPU Reduction](https://numba.pydata.org/numba-doc/latest/cuda/reduction.html) features that are provided by `numba.cuda`. We currently do not have anything similar and the output of this step should be a design to support a similar `@reduce` decorator for `numba-dppy`.

Links:
- https://github.com/IntelPython/numba-dppy/blob/main/HowTo.rst#reduction
- https://github.com/IntelPython/numba-dppy/blob/main/numba_dppy/examples/sum_reduction.py

Related [issues](https://github.com/IntelPython/numba-dppy/issues?q=is%3Aissue+is%3Aopen+reduction):
- https://github.com/IntelPython/numba-dppy/issues/71
- https://github.com/IntelPython/numba-dppy/issues/78
- https://github.com/IntelPython/numba-dppy/issues/96
- https://github.com/IntelPython/numba-dppy/issues/93

Features:
1. Reduction kernels (convert a **simple binary operation** into a reduction kernel)
```python
@cuda.reduce
def sum_reduce(a, b):
    return a + b
res = sum_reduce(arr)
```
`numba-dppy` does not provide decorator for reduction. 
From [HowTo](https://github.com/IntelPython/numba-dppy/blob/main/HowTo.rst#reduction):
> This can be implemented by invoking the kernel once, but that requires support for local device memory and barrier, which is a work in progress.

See [example](https://github.com/IntelPython/numba-dppy/blob/main/numba_dppy/examples/sum_reduction.py). 
2. Support lambdas
```python
sum_reduce = cuda.reduce(lambda a, b: a + b)
```
numba-dppy - No
3. Possible parameters

| Parameter | CUDA  | DPPY |
| --- | --- | --- |
| Works with host and device arrays | Yes | ? |
| Size of array | Yes | ? |
| Return value or output parameter | Yes | ? |
| Initial value | Yes | ? |
| Pin to stream | Yes | Pin to queue? |


Side gap:
- Support for local device memory and barrier
  - Local device memory and barriers could be supported [sum_reduction_ocl.py#L18](https://github.com/IntelPython/numba-dppy/blob/bafd6548cc4816d412d3007089f42cf550b2c36e/numba_dppy/examples/sum_reduction_ocl.py#L18)

Questions:
- Are local device memory and barrier supported now?
- Does ParallelFor device offloading support reductions? (#71)

We provide all features to write your reductions but we should also provide `@reduce` decorator.
Make it by hand is harder, `@reduce` will autogenerate boilerplate code.

Also missing feature is to support reductions in parfor reductions. We should wait for MLIR support for it. 

Missing features:
- https://github.com/IntelPython/numba-dppy/issues/182

Missing example:
- https://github.com/IntelPython/numba-dppy/issues/186

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gaps in GPU Reduction #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	CUDA	DPPY
Works with host and device arrays	Yes	?
Size of array	Yes	?
Return value or output parameter	Yes	?
Initial value	Yes	?
Pin to stream	Yes	Pin to queue?

Gaps in GPU Reduction #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions