-
Notifications
You must be signed in to change notification settings - Fork 30
Improvement to performance of tensor.sum #1303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1303/index.html |
Array API standard conformance tests for dpctl=0.14.6dev0=py310h7bf5fec_28 ran successfully. |
Currently giving incorrect results for some cases with axes:
|
This improves performance 8x-fold: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.ones((4096, 4096), dtype="f4") In [3]: y = dpt.sum(x, axis=0) In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 2.64 ms, sys: 4.4 ms, total: 7.04 ms Wall time: 10 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 1.93 ms, sys: 3.22 ms, total: 5.16 ms Wall time: 4.74 ms In [6]: %time y = dpt.sum(x, axis=0) CPU times: user 1.7 ms, sys: 2.83 ms, total: 4.53 ms Wall time: 4.1 ms In [7]: %time y = dpt.sum(x, axis=0) CPU times: user 1.98 ms, sys: 3.3 ms, total: 5.28 ms Wall time: 4.7 ms ``` The timing before was around 38ms
1d5228a
to
5c4f980
Compare
Array API standard conformance tests for dpctl=0.14.6dev0=py310h7bf5fec_53 ran successfully. |
- Adjusted to reduce branching and hopefully improve vectorization of the loop by removing a conditional
Array API standard conformance tests for dpctl=0.14.6dev0=py310h7bf5fec_62 ran successfully. |
Array API standard conformance tests for dpctl=0.14.6dev2=py310h7bf5fec_7 ran successfully. |
Array API standard conformance tests for dpctl=0.14.6dev2=py310h7bf5fec_10 ran successfully. |
The timing now:
It still has improved over 38 ms, but my claim of reducing to 4.7ms was erroneous (it was faster, due to it being incorrect). @ndgrigorian I would prefer to merge this change, and defer further work to later PRs. |
I agree. I'll approve this PR. |
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.14.6dev3=py310ha25a700_42 ran successfully. |
Transition sum-reduction from
sycl::nd_range<2>
tosycl::nd_range<1>
This improves performance 8x-fold:
The timing before was around 38ms