-
Notifications
You must be signed in to change notification settings - Fork 30
In-place addition, multiplication, subtraction of usm_ndarrays #1237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1237/index.html |
e9a4f30
to
cd3f169
Compare
Array API standard conformance tests for dpctl=0.14.3dev3=py310h7bf5fec_36 ran successfully. |
Array API standard conformance tests for dpctl=0.14.3dev3=py310h7bf5fec_38 ran successfully. |
Array API standard conformance tests for dpctl=0.14.3dev3=py310h7bf5fec_39 ran successfully. |
Array API standard conformance tests for dpctl=0.14.3dev3=py310h7bf5fec_42 ran successfully. |
dpctl/tensor/libtensor/include/kernels/elementwise_functions/common_inplace.hpp
Outdated
Show resolved
Hide resolved
- This change fixes some failing tests - Added additional tests for in-place addition
- Adjusted tests for in-place addition to improve coverage
… themselves - Now makes a copy and adds the copy to the original array
- functionality such as binop(x, y, out=x) now possible, with some edge cases still WIP
2a98e1d
to
da5f2f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ndgrigorian
Array API standard conformance tests for dpctl=0.14.3dev3=py310h7bf5fec_50 ran successfully. |
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.14.3dev4=py310h7bf5fec_10 ran successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small comments. You can ignore it, if you want.
sycl::vec<argT, vec_sz> arg_vec; | ||
sycl::vec<resT, vec_sz> res_vec; | ||
|
||
#pragma unroll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to compare performance against something simple like this:
int grid = ndit.get_group_linear_id();
int llid = ndit.get_local_linear_id();
int local_size = ndit.get_local_range(0); // is this 1-d task?
int base = grid*local_size*vec_sz*n_vecs;
#pragma unroll
for (int i = 0; i < vec_sz*n_vecs; ++i)
size_t k = base + local_size*i + llid;
if (k < nitems)
op(lhs[k], rhs[k]);
I'm pretty sure compiler can place load and store operations for you. You don't need to put it manually.
You probably need to split in two loops (over n_vecs and over vec_sz)
} | ||
} | ||
else { | ||
for (size_t k = base + sg.get_local_id()[0]; k < nelems_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one doesn't feels right.
Are you iterating over all items inside one subgroup?
You are comparing against nelems_
, not against vec_sz*n_vecs
.
Have you validated this version?
X = dpt.zeros((10, 10), dtype=dtype, sycl_queue=q) | ||
dt_kind = X.dtype.kind | ||
if dt_kind in "ui": | ||
X += int(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You really shouldn't test +=0
and *=1
. Use something, what would modify original data. E.g. +=1
and *=2
Also, I don't see results checking
|
||
sz = 127 | ||
ar1 = dpt.ones(sz, dtype=op1_dtype) | ||
ar2 = dpt.ones_like(ar1, dtype=op2_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is much better to use something like arange
or random data if possible. Testing on arrays where all elements are the same could play a bad joke with you.
v = dpt.arange(5, dtype="i4") | ||
|
||
m += v | ||
assert (dpt.asnumpy(m) == np.arange(1, 6, dtype="i4")[np.newaxis, :]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one good one 👍
but it tests only one dtype.
Closes #1229
This pull request implements in-place addition, subtraction, and multiplication of
usm_ndarray
instances.By avoiding the additional allocation of an output array, the performance advantage compared compared to applying the operator with a scalar, a la
dpt.add(x, 2)
, is considerable.For example: