Skip to content

Commit 0e8d4fb

Browse files
Update examples (#217)
* Updated source codes to compile with oneAPI Gold and latest changes in dpctl * sycl_direct_linkage: add README.md, fixed run.py * added sycl_direct_linkage/bench.py, expanded README.md * dress cython/sycl_direct_linkage/bench.py in black
1 parent 1d38c5c commit 0e8d4fb

File tree

6 files changed

+98
-4
lines changed

6 files changed

+98
-4
lines changed
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Example "sycl_direct_linkage"
2+
3+
This Cython extension does not use dpCtl and links to SYCL directly.
4+
5+
It exposes `columnwise_total` function that uses oneMKL to compute
6+
totals for each column of its argument matrix in double precision,
7+
expected as an ordinary NumPy array in C-contiguous layout.
8+
9+
This functions performs the following steps:
10+
11+
1. Create a SYCL queue using default device selector
12+
2. Creates SYCL buffer around the matrix data
13+
3. Creates a vector `v_ones` with all elements being ones,
14+
and allocates memory for the result.
15+
4. Calls oneMKL to compute xGEMV, as dot(v_ones, M)
16+
5. Returs the result as NumPy array
17+
18+
This extension does not allow one to control the device/queue to
19+
which execution of kernel is being schedules.
20+
21+
A related example "sycl_buffer" modifies this example in that it uses
22+
`dpCtl` to retrieve the current queue, allowing a user control the queue,
23+
and the avoid the overhead of the queue creation.
24+
25+
To illustrate the queue creation overhead in each call, compare execution of default queue,
26+
which is Intel Gen9 GPU on OpenCL backend:
27+
28+
```
29+
(idp) [11:24:38 ansatnuc04 sycl_direct_linkage]$ SYCL_BE=PI_OPENCL python bench.py
30+
========== Executing warm-up ==========
31+
NumPy result: [1. 1. 1. ... 1. 1. 1.]
32+
SYCL(default_device) result: [1. 1. 1. ... 1. 1. 1.]
33+
Running time of 100 calls to columnwise_total on matrix with shape (10000, 4098)
34+
Times for default_selector, inclusive of queue creation:
35+
[19.384219504892826, 19.49932464491576, 19.613155928440392, 19.64031868893653, 19.752969074994326]
36+
Times for NumPy
37+
[3.5394036192446947, 3.498957809060812, 3.4925728561356664, 3.5036555202677846, 3.493739523924887]
38+
```
39+
40+
vs. timing when `dpctl`'s current queue is being reused:
41+
42+
```
43+
(idp) [11:29:14 ansatnuc04 sycl_buffer]$ python bench.py
44+
========== Executing warm-up ==========
45+
NumPy result: [1. 1. 1. ... 1. 1. 1.]
46+
SYCL(Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz) result: [1. 1. 1. ... 1. 1. 1.]
47+
SYCL(Intel(R) Graphics Gen9 [0x9bca]) result: [1. 1. 1. ... 1. 1. 1.]
48+
Times for 'opencl:cpu:0'
49+
[2.9164800881408155, 2.8714500251226127, 2.9770236839540303, 2.913622073829174, 2.7949972581118345]
50+
Times for 'opencl:gpu:0'
51+
[9.529508924111724, 10.288004886358976, 10.189113245811313, 10.197128206957132, 10.26169267296791]
52+
Times for NumPy
53+
[3.4809365631081164, 3.42917942116037, 3.42471009073779, 3.3689011191017926, 3.4336009239777923]
54+
```
55+
56+
So the overhead of ``sycl::queue`` creation per call is roughly comparable with the time to
57+
execute the actual computation.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import dpctl
2+
import syclbuffer_naive as sb
3+
import numpy as np
4+
5+
X = np.full((10 ** 4, 4098), 1e-4, dtype="d")
6+
7+
# warm-up
8+
print("=" * 10 + " Executing warm-up " + "=" * 10)
9+
print("NumPy result: ", X.sum(axis=0))
10+
11+
print(
12+
"SYCL(default_device) result: {}".format(
13+
sb.columnwise_total(X),
14+
)
15+
)
16+
17+
import timeit
18+
19+
print(
20+
"Running time of 100 calls to columnwise_total on matrix with shape {}".format(
21+
X.shape
22+
)
23+
)
24+
25+
print("Times for default_selector, inclusive of queue creation:")
26+
print(
27+
timeit.repeat(
28+
stmt="sb.columnwise_total(X)",
29+
setup="sb.columnwise_total(X)", # ensure JIT compilation is not counted
30+
number=100,
31+
globals=globals(),
32+
)
33+
)
34+
35+
print("Times for NumPy")
36+
print(timeit.repeat(stmt="X.sum(axis=0)", number=100, globals=globals()))

examples/cython/sycl_direct_linkage/run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import syclbuffer as sb
1+
import syclbuffer_naive as sb
22
import numpy as np
33

44
X = np.random.randn(20, 10)

examples/cython/sycl_direct_linkage/sycl_function.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#include <CL/sycl.hpp>
22
#include "sycl_function.hpp"
3-
#include "mkl_blas_sycl.hpp"
3+
#include <oneapi/mkl.hpp>
44
#include "mkl.h"
55

66
int c_columnwise_total(cl::sycl::queue &q, size_t n, size_t m, double *mat, double *ct) {

examples/cython/usm_memory/blackscholes.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# distutils: language=c++
33

44
cimport dpctl as c_dpctl
5-
cimport dpctl._memory as c_dpctl_mem
5+
cimport dpctl.memory as c_dpctl_mem
66
cimport numpy as cnp
77
from cython cimport floating
88

examples/cython/usm_memory/sycl_blackscholes.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
#include <CL/sycl.hpp>
2+
#include <oneapi/mkl.hpp>
3+
#include <oneapi/mkl/rng/device.hpp>
24
#include "dpctl_sycl_types.h"
35
#include "sycl_blackscholes.hpp"
4-
#include "mkl_rng_sycl_device.hpp"
56

67
template<typename T>
78
class black_scholes_kernel;

0 commit comments

Comments
 (0)