-
Notifications
You must be signed in to change notification settings - Fork 30
Simplify copy and cast kernels #1165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Functions should store typed pointers instead of typeless. The CastFnT effectively becomes a trivial call to convert_impl in its call operator. Also added few data movement optimizations.
Example where it helps: ``` In [1]: import dpctl, dpctl.tensor as dpt In [2]: x = dpt.arange(1234*7873, dtype=dpt.int32) In [3]: xx = dpt.permute_dims(dpt.reshape(x, (2, 617, 7873)), (1,2,0)) In [4]: yy = dpt.permute_dims(dpt.reshape(dpt.empty_like(x, dtype="f4"), (2, 617, 7873)), (1,2,0)) In [5]: %timeit yy[...] = xx 1.07 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` in master the time is about 2.8 ms on Iris Xe.
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1165/index.html |
Array API standard conformance tests for dpctl=0.14.3dev0=py310h76be34b_98 ran successfully. |
- CopyAndCastContigFactory changed to reflect that contiguous copying and casting is now possible for more than strictly different data types
Array API standard conformance tests for dpctl=0.14.3dev0=py310h76be34b_105 ran successfully. |
@ndgrigorian Good catch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.14.3dev0=py310h76be34b_105 ran successfully. |
This PR works copy-and-cast functionality by simplifying underlying functors, removing specialized implementation for 2d arrays based on
CIndexer_array
as it is only very marginally faster than general strided implementation based onCIndexer_vector
.This PR also adds special implementation for copy-with-casting of contiguous arrays, making type casting of contiguous arrays several times faster.