Skip to content

Copying from numpy into usm_ndarray is unnecessarily slow #723

Closed
@oleksandr-pavlyk

Description

@oleksandr-pavlyk

Using the enclosed script time_copy.py it is clear that dpctl.tensor.usm_ndarray.__setitem__ is not efficient when copying C-contiguous host buffer into C-contiguous USM array:

(idp_2021.4) [13:25:40 ansatnuc04 python]$ python time_copy.py
Wall time:  0.00044969748705625534  sec.
Device time:  0.00010292000000000001  sec.
Wall time:  4.959066528826952  sec.
Device time:  0.717467438  sec.

This is likely because copying is done an element per kernel, and contiguity is not taken advantage of.

(idp_2021.4) [13:27:17 ansatnuc04 python]$ python -c "import dpctl; print(dpctl.__version__)"
0.12.0dev1+91.gb7a15ed9
time_copy.py script
# time_copy.py
import numpy as np

import dpctl
import dpctl.tensor as dpt
import dpctl.memory as dpm

n = 8 * 1024
host_array = np.random.random(size=n)

q = dpctl.SyclQueue("gpu", property="enable_profiling")

timer0 = dpctl.SyclTimer(time_scale=1) # report duration in seconds
with timer0(q):
    # copying using queue
    usm_array = dpt.empty(host_array.shape,
                          dtype=host_array.dtype,
                          sycl_queue=q)
    usm_array.usm_data.copy_from_host(host_array.reshape((-1)).view("u1"))

host_time, device_time = timer0.dt

print("Wall time: ", host_time, " sec.")
print("Device time: ", device_time, " sec.")

timer1 = dpctl.SyclTimer(time_scale=1) # report duration in seconds
with timer1(q):
    # copying using queue
    usm_array = dpt.asarray(host_array, sycl_queue=q)

host_time, device_time = timer1.dt

print("Wall time: ", host_time, " sec.")
print("Device time: ", device_time, " sec.")

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions