Skip to content

Commit 7c6906a

Browse files
author
Diptorup Deb
committed
Restructure and fill in the kernel programming guide.
1 parent fb5dd6d commit 7c6906a

File tree

9 files changed

+372
-220
lines changed

9 files changed

+372
-220
lines changed

docs/source/ext_links.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,6 @@
3131
.. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html
3232
.. _dpctl.tensor.usm_ndarray: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/usm_ndarray.html#dpctl.tensor.usm_ndarray
3333
.. _dpnp.ndarray: https://intelpython.github.io/dpnp/reference/ndarray.html
34+
35+
.. _Dispatcher: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#dispatcher-objects
36+
.. _Unboxes: https://numba.readthedocs.io/en/stable/extending/interval-example.html#boxing-and-unboxing
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
.. _launching-an-async-kernel:
2+
3+
Async kernel execution
4+
======================
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
.. _launching-a-kernel:
2+
3+
Launching a kernel
4+
==================
5+
6+
A ``kernel`` decorated kapi function produces a ``KernelDispatcher`` object that
7+
is a type of a Numba* `Dispatcher`_ object. However, unlike regular Numba*
8+
Dispatcher objects a ``KernelDispatcher`` object cannot be directly invoked from
9+
either CPython or another compiled Numba* ``jit`` function. To invoke a
10+
``kernel`` decorated function, a programmer has to use the
11+
:func:`numba_dpex.core.kernel_launcher.call_kernel` function.
12+
13+
To invoke a ``KernelDispatcher`` the ``call_kernel`` function requires three
14+
things: the ``KernelDispatcher`` object, the ``Range`` or ``NdRange`` object
15+
over which the kernel is to be executed, and the list of arguments to be passed
16+
to the compiled kernel. Once called with the necessary arguments, the
17+
``call_kernel`` function does the following main things:
18+
19+
- Compiles the ``KernelDispatcher`` object specializing it for the provided
20+
argument types.
21+
22+
- `Unboxes`_ the kernel arguments by converting CPython objects into Numba* or
23+
numba-dpex objects.
24+
25+
- Infer the execution queue on which to submit the kernel from the provided
26+
kernel arguments. (TODO: Refer compute follows data.)
27+
28+
- Submits the kernel to the execution queue.
29+
30+
- Waits for the execution completion, before returning control back to the
31+
caller.
32+
33+
.. important::
34+
Programmers should note the following two things when defining the global or
35+
local range to launch a kernel.
36+
37+
* Numba-dpex currently limits the maximum allowed global range size to
38+
``2^31-1``. It is due to the capabilities of current OpenCL GPU backends
39+
that generally do not support more than 32-bit global range sizes. A
40+
kernel requesting a larger global range than that will not execute and a
41+
``dpctl._sycl_queue.SyclKernelSubmitError`` will get raised.
42+
43+
The Intel dpcpp SYCL compiler does handle greater than 32-bit global
44+
ranges for GPU backends by wrapping the kernel in a new kernel that has
45+
each work-item perform multiple invocations of the original kernel in a
46+
32-bit global range. Such a feature is not yet available in numba-dpex.
47+
48+
* When launching an nd-range kernel, if the number of work-items for a
49+
particular dimension of a work-group exceeds the maximum device
50+
capability, it can result in undefined behavior.
51+
52+
The maximum allowed work-items for a device can be queried programmatically
53+
as shown in :ref:`ex_max_work_item`.
54+
55+
.. code-block:: python
56+
:linenos:
57+
:caption: **Example:** Query maximum number of work-items for a device
58+
:name: ex_max_work_item
59+
60+
import dpctl
61+
import math
62+
63+
d = dpctl.SyclDevice("gpu")
64+
d.print_device_info()
65+
66+
max_num_work_items = (
67+
d.max_work_group_size
68+
* d.max_work_item_sizes1d[0]
69+
* d.max_work_item_sizes2d[0]
70+
* d.max_work_item_sizes3d[0]
71+
)
72+
print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})")
73+
74+
cpud = dpctl.SyclDevice("cpu")
75+
cpud.print_device_info()
76+
77+
max_num_work_items_cpu = (
78+
cpud.max_work_group_size
79+
* cpud.max_work_item_sizes1d[0]
80+
* cpud.max_work_item_sizes2d[0]
81+
* cpud.max_work_item_sizes3d[0]
82+
)
83+
print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})")
84+
85+
The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated
86+
graphics processor and a 9th Generation Coffee Lake CPU is shown in
87+
:ref:`ex_max_work_item_output`.
88+
89+
.. code-block:: bash
90+
:caption: **OUTPUT:** Query maximum number of work-items for a device
91+
:name: ex_max_work_item_output
92+
93+
Name Intel(R) UHD Graphics 630 [0x3e98]
94+
Driver version 1.3.24595
95+
Vendor Intel(R) Corporation
96+
Filter string level_zero:gpu:0
97+
98+
4294967296 (2^32)
99+
Name Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
100+
Driver version 2023.16.12.0.12_195853.xmain-hotfix
101+
Vendor Intel(R) Corporation
102+
Filter string opencl:cpu:0
103+
104+
4503599627370496 (2^52)
105+
106+
107+
The ``call_kernel`` function can be invoked both from CPython and from another
108+
Numba* compiled function. Note that the ``call_kernel`` function supports only
109+
synchronous execution of kernel and the ``call_kernel_async`` function should be
110+
used for asynchronous mode of kernel execution (refer
111+
:ref:`launching-an-async-kernel`).
112+
113+
114+
.. seealso::
115+
116+
Refer the API documentation for
117+
:func:`numba_dpex.core.kernel_launcher.call_kernel` for more details.

docs/source/user_guide/kernel_programming/device-functions.rst

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,33 @@
11
Numba-dpex provides a decorator to express auxiliary device-only functions that
22
can be called from a kernel or another device function, but are not callable
3-
from the host. This decorator :func:`numba_dpex.experimental.device_func` has no
4-
direct analogue in SYCL and primarily is provided to help programmers make their
5-
kapi applications modular.
3+
from the host. This decorator :func:`numba_dpex.core.decorators.device_func` has
4+
no direct analogue in SYCL and primarily is provided to help programmers make
5+
their kapi applications modular. :ref:`ex_device_func1` shows a simple usage of
6+
the ``device_func`` decorator.
67

78
.. code-block:: python
9+
:linenos:
10+
:caption: **Example:** Basic usage of device_func
11+
:name: ex_device_func1
812
913
import dpnp
1014
11-
from numba_dpex import experimental as dpex_exp
15+
import numba_dpex as dpex
1216
from numba_dpex import kernel_api as kapi
1317
1418
# Array size
1519
N = 10
1620
1721
18-
@dpex_exp.device_func
22+
@dpex.device_func
1923
def a_device_function(a):
20-
"""A device callable function that can be invoked from a ``kernel`` or
24+
"""A device callable function that can be invoked from a kernel or
2125
another device function.
2226
"""
2327
return a + 1
2428
2529
26-
@dpex_exp.kernel
30+
@dpex.kernel
2731
def a_kernel_function(item: kapi.Item, a, b):
2832
"""Demonstrates calling a device function from a kernel."""
2933
i = item.get_id(0)
@@ -34,31 +38,45 @@ kapi applications modular.
3438
a = dpnp.ones(N, dtype=dpnp.int32)
3539
b = dpnp.zeros(N, dtype=dpnp.int32)
3640
37-
dpex_exp.call_kernel(a_kernel_function, dpex.Range(N), a, b)
41+
dpex.call_kernel(a_kernel_function, dpex.Range(N), a, b)
3842
3943
40-
@dpex_exp.device_func
41-
def increment_value(nd_item: NdItem, a):
44+
.. code-block:: python
45+
:linenos:
46+
:caption: **Example:** Using kapi functionalities in a device_func
47+
:name: ex_device_func2
48+
49+
import dpnp
50+
51+
import numba_dpex as dpex
52+
from numba_dpex import kernel_api as kapi
53+
54+
55+
@dpex.device_func
56+
def increment_value(nd_item: kapi.NdItem, a):
4257
"""Demonstrates the usage of group_barrier and NdItem usage in a
4358
device_func.
4459
"""
4560
i = nd_item.get_global_id(0)
4661
4762
a[i] += 1
48-
group_barrier(nd_item.get_group(), MemoryScope.DEVICE)
63+
kapi.group_barrier(nd_item.get_group(), kapi.MemoryScope.DEVICE)
4964
5065
if i == 0:
5166
for idx in range(1, a.size):
5267
a[0] += a[idx]
5368
5469
55-
@dpex_exp.kernel
56-
def another_kernel(nd_item: NdItem, a):
70+
@dpex.kernel
71+
def another_kernel(nd_item: kapi.NdItem, a):
5772
"""The kernel does everything by calling a device_func."""
5873
increment_value(nd_item, a)
5974
6075
61-
dpex_exp.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b)
76+
N = 16
77+
b = dpnp.ones(N, dtype=dpnp.int32)
78+
79+
dpex.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b)
6280
6381
6482
A device function does not require the first argument to be an index space id

0 commit comments

Comments
 (0)