Closed
Description
Using the enclosed script time_copy.py
it is clear that dpctl.tensor.usm_ndarray.__setitem__
is not efficient when copying C-contiguous host buffer into C-contiguous USM array:
(idp_2021.4) [13:25:40 ansatnuc04 python]$ python time_copy.py
Wall time: 0.00044969748705625534 sec.
Device time: 0.00010292000000000001 sec.
Wall time: 4.959066528826952 sec.
Device time: 0.717467438 sec.
This is likely because copying is done an element per kernel, and contiguity is not taken advantage of.
(idp_2021.4) [13:27:17 ansatnuc04 python]$ python -c "import dpctl; print(dpctl.__version__)"
0.12.0dev1+91.gb7a15ed9
time_copy.py script
# time_copy.py
import numpy as np
import dpctl
import dpctl.tensor as dpt
import dpctl.memory as dpm
n = 8 * 1024
host_array = np.random.random(size=n)
q = dpctl.SyclQueue("gpu", property="enable_profiling")
timer0 = dpctl.SyclTimer(time_scale=1) # report duration in seconds
with timer0(q):
# copying using queue
usm_array = dpt.empty(host_array.shape,
dtype=host_array.dtype,
sycl_queue=q)
usm_array.usm_data.copy_from_host(host_array.reshape((-1)).view("u1"))
host_time, device_time = timer0.dt
print("Wall time: ", host_time, " sec.")
print("Device time: ", device_time, " sec.")
timer1 = dpctl.SyclTimer(time_scale=1) # report duration in seconds
with timer1(q):
# copying using queue
usm_array = dpt.asarray(host_array, sycl_queue=q)
host_time, device_time = timer1.dt
print("Wall time: ", host_time, " sec.")
print("Device time: ", device_time, " sec.")
Metadata
Metadata
Assignees
Labels
No labels