Skip to content
This repository was archived by the owner on Jan 25, 2023. It is now read-only.

Commit c0708b4

Browse files
committed
Add HowTo.rst
Moved from IntelPython/numba/HowTo.rst
1 parent 7b1d06b commit c0708b4

File tree

1 file changed

+175
-0
lines changed

1 file changed

+175
-0
lines changed

HowTo.rst

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
========
2+
Features
3+
========
4+
5+
DPPL is currently implemented using OpenCL 2.1. The features currently available
6+
are listed below with the help of sample code snippets. In this release we have
7+
the implementation of the OAK approach described in MS138 in section 4.3.2. The
8+
new decorator is described below.
9+
10+
To access the features driver module have to be imported from numba.dppl.dppl_driver
11+
12+
New Decorator
13+
=============
14+
15+
The new decorator included in this release is *dppl.kernel*. Currently this decorator
16+
takes only one option *access_types* which is explained below with the help of an example.
17+
Users can write OpenCL tpye kernels where they can identify the global id of the work item
18+
being executed. The supported methods inside a decorated function are:
19+
20+
- dppl.get_global_id(dimidx)
21+
- dppl.get_local_id(dimidx)
22+
- dppl.get_group_num(dimidx)
23+
- dppl.get_num_groups(dimidx)
24+
- dppl.get_work_dim()
25+
- dppl.get_global_size(dimidx)
26+
- dppl.get_local_size(dimidx)
27+
28+
Currently no support is provided for local memory in the device and everything is in the
29+
global memory. Barrier and other memory fences will be provided once support for local
30+
memory is completed.
31+
32+
33+
Device Environment
34+
==================
35+
36+
To invoke a kernel a device environemnt is required. The device environment can be
37+
initialized by the following methods:
38+
39+
- driver.runtime.get_gpu_device()
40+
- driver.runtime.get_cpu_device()
41+
42+
43+
Device Array
44+
============
45+
46+
Device arrays are used for representing memory buffers in the device. Device Array
47+
supports only ndarrays in this release. Convenience
48+
methods are provided to allocate a memory buffer represnting ndarrays in the device.
49+
They are:
50+
51+
- device_env.copy_array_to_device(ndarray) : Allocate buffer of size ndarray
52+
and copy the data from host to
53+
device.
54+
55+
- driver.DeviceArray(device_env.get_env_ptr(), ndarray) : Allocate buffer of size ndarray.
56+
57+
58+
Primitive types are passed by value to the kernel, currently supported are int, float, double.
59+
60+
61+
Math Kernels
62+
============
63+
64+
This release has support for math kernels. See numba/dppl/tests/dppl/test_math_functions.py
65+
for more details.
66+
67+
68+
========
69+
Examples
70+
========
71+
72+
Sum of two 1d arrays
73+
====================
74+
75+
Full example can be found at numba/dppl/examples/sum.py.
76+
77+
To write a program that sums two 1d arrays we at first need a OpenCL device environment.
78+
We can get the environment by using *ocldrv.runtime.get_gpu_device()* for getting the
79+
GPU environment or *ocldrv.runtime.get_cpu_device(data)* for the CPU environment. We then
80+
need to copy the data (which has to be an ndarray) to the device (CPU or GPU) through OpenCL,
81+
where *device_env.copy_array_to_device(data)* will read the ndarray and copy that to the device
82+
and *ocldrv.DeviceArray(device_env.get_env_ptr(), data)* will create a buffer in the device
83+
that has the same memory size as the ndarray being passed. The OpenCL Kernel in the
84+
folllowing example is *data_parallel_sum*. To get the id of the work item we are currently
85+
executing we need to use the *dppl.get_global_id(0)*, since this example only 1 dimension
86+
we only need to get the id in dimension 0.
87+
88+
While invoking the kernel we need to pass the device environment and the global work size.
89+
After the kernel is executed we want to get the data that contains the sum of the two 1d arrays
90+
back to the host and we can use *device_env.copy_array_from_device(ddata)*.
91+
92+
.. code-block:: python
93+
94+
@dppl.kernel
95+
def data_parallel_sum(a, b, c):
96+
i = dppl.get_global_id(0)
97+
c[i] = a[i] + b[i]
98+
99+
global_size = 10
100+
N = global_size
101+
102+
a = np.array(np.random.random(N), dtype=np.float32)
103+
b = np.array(np.random.random(N), dtype=np.float32)
104+
c = np.ones_like(a)
105+
106+
# Select a device for executing the kernel
107+
device_env = None
108+
try:
109+
device_env = ocldrv.runtime.get_gpu_device()
110+
except:
111+
try:
112+
device_env = ocldrv.runtime.get_cpu_device()
113+
except:
114+
raise SystemExit()
115+
116+
# Copy the data to the device
117+
dA = device_env.copy_array_to_device(a)
118+
dB = device_env.copy_array_to_device(b)
119+
dC = ocldrv.DeviceArray(device_env.get_env_ptr(), c)
120+
121+
data_parallel_sum[device_env, global_size](dA, dB, dC)
122+
device_env.copy_array_from_device(dC)
123+
124+
ndArray Support
125+
===============
126+
127+
Support for passing ndarray directly to kernels is also supported.
128+
129+
Full example can be found at numba/dppl/examples/sum_ndarray.py
130+
131+
For availing this feature instead of creating device buffers explicitly like the previous
132+
example, users can directly pass the ndarray to the kernel. Internally it will result in
133+
copying the existing data in the ndarray to the device and will copy it back after the kernel
134+
is done executing.
135+
136+
In the previous example we can see some redundant work being done. The buffer
137+
that will hold the result of the summation in the device does not need to be copied from the host
138+
and the input data which will be added does not need to be copied back to the host after the
139+
kernel has executed. To reduce doing redundant work, users can provide hints to the compiler
140+
using the access_types to the function decorator. Currently, there are three access types:
141+
*read_only* meaning data will only be copied from host to device, *write_only* meaning memory
142+
will be allocated in device and will be copied back to host and *read_write* which will both
143+
copy data to and from device.
144+
145+
146+
Reduction
147+
=========
148+
149+
This example will demonstrate a sum reduction of 1d array.
150+
151+
Full example can be found at numba/dppl/examples/sum_reduction.py.
152+
153+
In this example to sum the 1d array we invoke the Kernel multiple times.
154+
This can be implemented by invoking the kernel once, but that requires
155+
support for local device memory and barrier, which is a work in progress.
156+
157+
158+
==============
159+
ParFor Support
160+
==============
161+
162+
*Parallel For* is supported in this release for upto 3 dimensions.
163+
164+
Full examples can be found in numba/dppl/examples/pa_examples/
165+
166+
167+
=======
168+
Testing
169+
=======
170+
171+
All examples can be found in numba/dppl/examples/
172+
173+
All tests can be found in numba/dppl/tests/dppl and can be triggered by the following command:
174+
175+
``python -m numba.runtests numba.dppl.tests``

0 commit comments

Comments
 (0)