|
| 1 | +.. _launching-a-kernel: |
| 2 | + |
| 3 | +Launching a kernel |
| 4 | +================== |
| 5 | + |
| 6 | +A ``kernel`` decorated kapi function produces a ``KernelDispatcher`` object that |
| 7 | +is a type of a Numba* `Dispatcher`_ object. However, unlike regular Numba* |
| 8 | +Dispatcher objects a ``KernelDispatcher`` object cannot be directly invoked from |
| 9 | +either CPython or another compiled Numba* ``jit`` function. To invoke a |
| 10 | +``kernel`` decorated function, a programmer has to use the |
| 11 | +:func:`numba_dpex.core.kernel_launcher.call_kernel` function. |
| 12 | + |
| 13 | +To invoke a ``KernelDispatcher`` the ``call_kernel`` function requires three |
| 14 | +things: the ``KernelDispatcher`` object, the ``Range`` or ``NdRange`` object |
| 15 | +over which the kernel is to be executed, and the list of arguments to be passed |
| 16 | +to the compiled kernel. Once called with the necessary arguments, the |
| 17 | +``call_kernel`` function does the following main things: |
| 18 | + |
| 19 | +- Compiles the ``KernelDispatcher`` object specializing it for the provided |
| 20 | + argument types. |
| 21 | + |
| 22 | +- `Unboxes`_ the kernel arguments by converting CPython objects into Numba* or |
| 23 | + numba-dpex objects. |
| 24 | + |
| 25 | +- Infer the execution queue on which to submit the kernel from the provided |
| 26 | + kernel arguments. (TODO: Refer compute follows data.) |
| 27 | + |
| 28 | +- Submits the kernel to the execution queue. |
| 29 | + |
| 30 | +- Waits for the execution completion, before returning control back to the |
| 31 | + caller. |
| 32 | + |
| 33 | +.. important:: |
| 34 | + Programmers should note the following two things when defining the global or |
| 35 | + local range to launch a kernel. |
| 36 | + |
| 37 | + * Numba-dpex currently limits the maximum allowed global range size to |
| 38 | + ``2^31-1``. It is due to the capabilities of current OpenCL GPU backends |
| 39 | + that generally do not support more than 32-bit global range sizes. A |
| 40 | + kernel requesting a larger global range than that will not execute and a |
| 41 | + ``dpctl._sycl_queue.SyclKernelSubmitError`` will get raised. |
| 42 | + |
| 43 | + The Intel dpcpp SYCL compiler does handle greater than 32-bit global |
| 44 | + ranges for GPU backends by wrapping the kernel in a new kernel that has |
| 45 | + each work-item perform multiple invocations of the original kernel in a |
| 46 | + 32-bit global range. Such a feature is not yet available in numba-dpex. |
| 47 | + |
| 48 | + * When launching an nd-range kernel, if the number of work-items for a |
| 49 | + particular dimension of a work-group exceeds the maximum device |
| 50 | + capability, it can result in undefined behavior. |
| 51 | + |
| 52 | + The maximum allowed work-items for a device can be queried programmatically |
| 53 | + as shown in :ref:`ex_max_work_item`. |
| 54 | + |
| 55 | + .. code-block:: python |
| 56 | + :linenos: |
| 57 | + :caption: **Example:** Query maximum number of work-items for a device |
| 58 | + :name: ex_max_work_item |
| 59 | +
|
| 60 | + import dpctl |
| 61 | + import math |
| 62 | +
|
| 63 | + d = dpctl.SyclDevice("gpu") |
| 64 | + d.print_device_info() |
| 65 | +
|
| 66 | + max_num_work_items = ( |
| 67 | + d.max_work_group_size |
| 68 | + * d.max_work_item_sizes1d[0] |
| 69 | + * d.max_work_item_sizes2d[0] |
| 70 | + * d.max_work_item_sizes3d[0] |
| 71 | + ) |
| 72 | + print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})") |
| 73 | +
|
| 74 | + cpud = dpctl.SyclDevice("cpu") |
| 75 | + cpud.print_device_info() |
| 76 | +
|
| 77 | + max_num_work_items_cpu = ( |
| 78 | + cpud.max_work_group_size |
| 79 | + * cpud.max_work_item_sizes1d[0] |
| 80 | + * cpud.max_work_item_sizes2d[0] |
| 81 | + * cpud.max_work_item_sizes3d[0] |
| 82 | + ) |
| 83 | + print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})") |
| 84 | +
|
| 85 | + The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated |
| 86 | + graphics processor and a 9th Generation Coffee Lake CPU is shown in |
| 87 | + :ref:`ex_max_work_item_output`. |
| 88 | +
|
| 89 | + .. code-block:: bash |
| 90 | + :caption: **OUTPUT:** Query maximum number of work-items for a device |
| 91 | + :name: ex_max_work_item_output |
| 92 | +
|
| 93 | + Name Intel(R) UHD Graphics 630 [0x3e98] |
| 94 | + Driver version 1.3.24595 |
| 95 | + Vendor Intel(R) Corporation |
| 96 | + Filter string level_zero:gpu:0 |
| 97 | +
|
| 98 | + 4294967296 (2^32) |
| 99 | + Name Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz |
| 100 | + Driver version 2023.16.12.0.12_195853.xmain-hotfix |
| 101 | + Vendor Intel(R) Corporation |
| 102 | + Filter string opencl:cpu:0 |
| 103 | +
|
| 104 | + 4503599627370496 (2^52) |
| 105 | +
|
| 106 | +
|
| 107 | +The ``call_kernel`` function can be invoked both from CPython and from another |
| 108 | +Numba* compiled function. Note that the ``call_kernel`` function supports only |
| 109 | +synchronous execution of kernel and the ``call_kernel_async`` function should be |
| 110 | +used for asynchronous mode of kernel execution (refer |
| 111 | +:ref:`launching-an-async-kernel`). |
| 112 | +
|
| 113 | +
|
| 114 | +.. seealso:: |
| 115 | +
|
| 116 | + Refer the API documentation for |
| 117 | + :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details. |
0 commit comments