|
| 1 | +======== |
| 2 | +Features |
| 3 | +======== |
| 4 | + |
| 5 | +DPPL is currently implemented using OpenCL 2.1. The features currently available |
| 6 | +are listed below with the help of sample code snippets. In this release we have |
| 7 | +the implementation of the OAK approach described in MS138 in section 4.3.2. The |
| 8 | +new decorator is described below. |
| 9 | + |
| 10 | +To access the features driver module have to be imported from numba.dppl.dppl_driver |
| 11 | + |
| 12 | +New Decorator |
| 13 | +============= |
| 14 | + |
| 15 | +The new decorator included in this release is *dppl.kernel*. Currently this decorator |
| 16 | +takes only one option *access_types* which is explained below with the help of an example. |
| 17 | +Users can write OpenCL tpye kernels where they can identify the global id of the work item |
| 18 | +being executed. The supported methods inside a decorated function are: |
| 19 | + |
| 20 | +- dppl.get_global_id(dimidx) |
| 21 | +- dppl.get_local_id(dimidx) |
| 22 | +- dppl.get_group_num(dimidx) |
| 23 | +- dppl.get_num_groups(dimidx) |
| 24 | +- dppl.get_work_dim() |
| 25 | +- dppl.get_global_size(dimidx) |
| 26 | +- dppl.get_local_size(dimidx) |
| 27 | + |
| 28 | +Currently no support is provided for local memory in the device and everything is in the |
| 29 | +global memory. Barrier and other memory fences will be provided once support for local |
| 30 | +memory is completed. |
| 31 | + |
| 32 | + |
| 33 | +Device Environment |
| 34 | +================== |
| 35 | + |
| 36 | +To invoke a kernel a device environemnt is required. The device environment can be |
| 37 | +initialized by the following methods: |
| 38 | + |
| 39 | +- driver.runtime.get_gpu_device() |
| 40 | +- driver.runtime.get_cpu_device() |
| 41 | + |
| 42 | + |
| 43 | +Device Array |
| 44 | +============ |
| 45 | + |
| 46 | +Device arrays are used for representing memory buffers in the device. Device Array |
| 47 | +supports only ndarrays in this release. Convenience |
| 48 | +methods are provided to allocate a memory buffer represnting ndarrays in the device. |
| 49 | +They are: |
| 50 | + |
| 51 | +- device_env.copy_array_to_device(ndarray) : Allocate buffer of size ndarray |
| 52 | + and copy the data from host to |
| 53 | + device. |
| 54 | + |
| 55 | +- driver.DeviceArray(device_env.get_env_ptr(), ndarray) : Allocate buffer of size ndarray. |
| 56 | + |
| 57 | + |
| 58 | +Primitive types are passed by value to the kernel, currently supported are int, float, double. |
| 59 | + |
| 60 | + |
| 61 | +Math Kernels |
| 62 | +============ |
| 63 | + |
| 64 | +This release has support for math kernels. See numba/dppl/tests/dppl/test_math_functions.py |
| 65 | +for more details. |
| 66 | + |
| 67 | + |
| 68 | +======== |
| 69 | +Examples |
| 70 | +======== |
| 71 | + |
| 72 | +Sum of two 1d arrays |
| 73 | +==================== |
| 74 | + |
| 75 | +Full example can be found at numba/dppl/examples/sum.py. |
| 76 | + |
| 77 | +To write a program that sums two 1d arrays we at first need a OpenCL device environment. |
| 78 | +We can get the environment by using *ocldrv.runtime.get_gpu_device()* for getting the |
| 79 | +GPU environment or *ocldrv.runtime.get_cpu_device(data)* for the CPU environment. We then |
| 80 | +need to copy the data (which has to be an ndarray) to the device (CPU or GPU) through OpenCL, |
| 81 | +where *device_env.copy_array_to_device(data)* will read the ndarray and copy that to the device |
| 82 | +and *ocldrv.DeviceArray(device_env.get_env_ptr(), data)* will create a buffer in the device |
| 83 | +that has the same memory size as the ndarray being passed. The OpenCL Kernel in the |
| 84 | +folllowing example is *data_parallel_sum*. To get the id of the work item we are currently |
| 85 | +executing we need to use the *dppl.get_global_id(0)*, since this example only 1 dimension |
| 86 | +we only need to get the id in dimension 0. |
| 87 | + |
| 88 | +While invoking the kernel we need to pass the device environment and the global work size. |
| 89 | +After the kernel is executed we want to get the data that contains the sum of the two 1d arrays |
| 90 | +back to the host and we can use *device_env.copy_array_from_device(ddata)*. |
| 91 | + |
| 92 | +.. code-block:: python |
| 93 | +
|
| 94 | + @dppl.kernel |
| 95 | + def data_parallel_sum(a, b, c): |
| 96 | + i = dppl.get_global_id(0) |
| 97 | + c[i] = a[i] + b[i] |
| 98 | +
|
| 99 | + global_size = 10 |
| 100 | + N = global_size |
| 101 | +
|
| 102 | + a = np.array(np.random.random(N), dtype=np.float32) |
| 103 | + b = np.array(np.random.random(N), dtype=np.float32) |
| 104 | + c = np.ones_like(a) |
| 105 | +
|
| 106 | + # Select a device for executing the kernel |
| 107 | + device_env = None |
| 108 | + try: |
| 109 | + device_env = ocldrv.runtime.get_gpu_device() |
| 110 | + except: |
| 111 | + try: |
| 112 | + device_env = ocldrv.runtime.get_cpu_device() |
| 113 | + except: |
| 114 | + raise SystemExit() |
| 115 | +
|
| 116 | + # Copy the data to the device |
| 117 | + dA = device_env.copy_array_to_device(a) |
| 118 | + dB = device_env.copy_array_to_device(b) |
| 119 | + dC = ocldrv.DeviceArray(device_env.get_env_ptr(), c) |
| 120 | +
|
| 121 | + data_parallel_sum[device_env, global_size](dA, dB, dC) |
| 122 | + device_env.copy_array_from_device(dC) |
| 123 | +
|
| 124 | +ndArray Support |
| 125 | +=============== |
| 126 | + |
| 127 | +Support for passing ndarray directly to kernels is also supported. |
| 128 | + |
| 129 | +Full example can be found at numba/dppl/examples/sum_ndarray.py |
| 130 | + |
| 131 | +For availing this feature instead of creating device buffers explicitly like the previous |
| 132 | +example, users can directly pass the ndarray to the kernel. Internally it will result in |
| 133 | +copying the existing data in the ndarray to the device and will copy it back after the kernel |
| 134 | +is done executing. |
| 135 | + |
| 136 | +In the previous example we can see some redundant work being done. The buffer |
| 137 | +that will hold the result of the summation in the device does not need to be copied from the host |
| 138 | +and the input data which will be added does not need to be copied back to the host after the |
| 139 | +kernel has executed. To reduce doing redundant work, users can provide hints to the compiler |
| 140 | +using the access_types to the function decorator. Currently, there are three access types: |
| 141 | +*read_only* meaning data will only be copied from host to device, *write_only* meaning memory |
| 142 | +will be allocated in device and will be copied back to host and *read_write* which will both |
| 143 | +copy data to and from device. |
| 144 | + |
| 145 | + |
| 146 | +Reduction |
| 147 | +========= |
| 148 | + |
| 149 | +This example will demonstrate a sum reduction of 1d array. |
| 150 | + |
| 151 | +Full example can be found at numba/dppl/examples/sum_reduction.py. |
| 152 | + |
| 153 | +In this example to sum the 1d array we invoke the Kernel multiple times. |
| 154 | +This can be implemented by invoking the kernel once, but that requires |
| 155 | +support for local device memory and barrier, which is a work in progress. |
| 156 | + |
| 157 | + |
| 158 | +============== |
| 159 | +ParFor Support |
| 160 | +============== |
| 161 | + |
| 162 | +*Parallel For* is supported in this release for upto 3 dimensions. |
| 163 | + |
| 164 | +Full examples can be found in numba/dppl/examples/pa_examples/ |
| 165 | + |
| 166 | + |
| 167 | +======= |
| 168 | +Testing |
| 169 | +======= |
| 170 | + |
| 171 | +All examples can be found in numba/dppl/examples/ |
| 172 | + |
| 173 | +All tests can be found in numba/dppl/tests/dppl and can be triggered by the following command: |
| 174 | + |
| 175 | +``python -m numba.runtests numba.dppl.tests`` |
0 commit comments