|
| 1 | +# Example "sycl_direct_linkage" |
| 2 | + |
| 3 | +This Cython extension does not use dpCtl and links to SYCL directly. |
| 4 | + |
| 5 | +It exposes `columnwise_total` function that uses oneMKL to compute |
| 6 | +totals for each column of its argument matrix in double precision, |
| 7 | +expected as an ordinary NumPy array in C-contiguous layout. |
| 8 | + |
| 9 | +This functions performs the following steps: |
| 10 | + |
| 11 | + 1. Create a SYCL queue using default device selector |
| 12 | + 2. Creates SYCL buffer around the matrix data |
| 13 | + 3. Creates a vector `v_ones` with all elements being ones, |
| 14 | + and allocates memory for the result. |
| 15 | + 4. Calls oneMKL to compute xGEMV, as dot(v_ones, M) |
| 16 | + 5. Returs the result as NumPy array |
| 17 | + |
| 18 | +This extension does not allow one to control the device/queue to |
| 19 | +which execution of kernel is being schedules. |
| 20 | + |
| 21 | +A related example "sycl_buffer" modifies this example in that it uses |
| 22 | +`dpCtl` to retrieve the current queue, allowing a user control the queue, |
| 23 | +and the avoid the overhead of the queue creation. |
| 24 | + |
| 25 | +To illustrate the queue creation overhead in each call, compare execution of default queue, |
| 26 | +which is Intel Gen9 GPU on OpenCL backend: |
| 27 | + |
| 28 | +``` |
| 29 | +(idp) [11:24:38 ansatnuc04 sycl_direct_linkage]$ SYCL_BE=PI_OPENCL python bench.py |
| 30 | +========== Executing warm-up ========== |
| 31 | +NumPy result: [1. 1. 1. ... 1. 1. 1.] |
| 32 | +SYCL(default_device) result: [1. 1. 1. ... 1. 1. 1.] |
| 33 | +Running time of 100 calls to columnwise_total on matrix with shape (10000, 4098) |
| 34 | +Times for default_selector, inclusive of queue creation: |
| 35 | +[19.384219504892826, 19.49932464491576, 19.613155928440392, 19.64031868893653, 19.752969074994326] |
| 36 | +Times for NumPy |
| 37 | +[3.5394036192446947, 3.498957809060812, 3.4925728561356664, 3.5036555202677846, 3.493739523924887] |
| 38 | +``` |
| 39 | + |
| 40 | +vs. timing when `dpctl`'s current queue is being reused: |
| 41 | + |
| 42 | +``` |
| 43 | +(idp) [11:29:14 ansatnuc04 sycl_buffer]$ python bench.py |
| 44 | +========== Executing warm-up ========== |
| 45 | +NumPy result: [1. 1. 1. ... 1. 1. 1.] |
| 46 | +SYCL(Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz) result: [1. 1. 1. ... 1. 1. 1.] |
| 47 | +SYCL(Intel(R) Graphics Gen9 [0x9bca]) result: [1. 1. 1. ... 1. 1. 1.] |
| 48 | +Times for 'opencl:cpu:0' |
| 49 | +[2.9164800881408155, 2.8714500251226127, 2.9770236839540303, 2.913622073829174, 2.7949972581118345] |
| 50 | +Times for 'opencl:gpu:0' |
| 51 | +[9.529508924111724, 10.288004886358976, 10.189113245811313, 10.197128206957132, 10.26169267296791] |
| 52 | +Times for NumPy |
| 53 | +[3.4809365631081164, 3.42917942116037, 3.42471009073779, 3.3689011191017926, 3.4336009239777923] |
| 54 | +``` |
| 55 | + |
| 56 | +So the overhead of ``sycl::queue`` creation per call is roughly comparable with the time to |
| 57 | +execute the actual computation. |
0 commit comments