Skip to content

Improvement to performance of tensor.sum #1303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 18, 2023
Merged

Conversation

oleksandr-pavlyk
Copy link
Contributor

Transition sum-reduction from sycl::nd_range<2> to sycl::nd_range<1>

This improves performance 8x-fold:

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((4096, 4096), dtype="f4")

In [3]: y = dpt.sum(x, axis=0)

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 2.64 ms, sys: 4.4 ms, total: 7.04 ms
Wall time: 10 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.93 ms, sys: 3.22 ms, total: 5.16 ms
Wall time: 4.74 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.7 ms, sys: 2.83 ms, total: 4.53 ms
Wall time: 4.1 ms

In [7]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.98 ms, sys: 3.3 ms, total: 5.28 ms
Wall time: 4.7 ms

The timing before was around 38ms

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

@github-actions
Copy link

@coveralls
Copy link
Collaborator

coveralls commented Jul 25, 2023

Coverage Status

coverage: 84.95%. remained the same when pulling 9f54428 on reduction-changes into 074ec3a on master.

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev0=py310h7bf5fec_28 ran successfully.
Passed: 477
Failed: 523
Skipped: 119

@ndgrigorian
Copy link
Collaborator

ndgrigorian commented Jul 26, 2023

Currently giving incorrect results for some cases with axes:

In [1]: import dpctl.tensor as dpt, numpy as np

In [2]: x = dpt.ones((2, 4098, 4098), dtype="i2")

In [3]: x_np = dpt.asnumpy(x)

In [4]: dpt.sum(x, axis=(0, 1))
Out[4]: usm_ndarray([8704, 8704, 8704, ...,  68,   68,   68])

In [5]: np.sum(x_np, axis=(0, 1))
Out[5]: array([8196, 8196, 8196, ..., 8196, 8196, 8196])

This improves performance 8x-fold:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((4096, 4096), dtype="f4")

In [3]: y = dpt.sum(x, axis=0)

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 2.64 ms, sys: 4.4 ms, total: 7.04 ms
Wall time: 10 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.93 ms, sys: 3.22 ms, total: 5.16 ms
Wall time: 4.74 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.7 ms, sys: 2.83 ms, total: 4.53 ms
Wall time: 4.1 ms

In [7]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.98 ms, sys: 3.3 ms, total: 5.28 ms
Wall time: 4.7 ms
```

The timing before was around 38ms
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev0=py310h7bf5fec_53 ran successfully.
Passed: 567
Failed: 433
Skipped: 119

- Adjusted to reduce branching and hopefully improve vectorization of the loop by removing a conditional
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev0=py310h7bf5fec_62 ran successfully.
Passed: 594
Failed: 406
Skipped: 119

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev2=py310h7bf5fec_7 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev2=py310h7bf5fec_10 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

@oleksandr-pavlyk
Copy link
Contributor Author

The timing now:

(dev_dpctl) opavlyk@opavlyk-mobl:~/repos/dpctl$ ipython
Python 3.9.12 (main, Jun  1 2022, 11:38:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((4096, 4096), dtype="f4")

In [3]: y = dpt.sum(x, axis=0)

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 25.2 ms, sys: 6.65 ms, total: 31.9 ms
Wall time: 33.8 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 10.6 ms, sys: 20.1 ms, total: 30.7 ms
Wall time: 30 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 13.4 ms, sys: 19 ms, total: 32.4 ms
Wall time: 31.1 ms

In [7]: %time y = dpt.sum(x, axis=0)
CPU times: user 10.4 ms, sys: 20.1 ms, total: 30.4 ms
Wall time: 29.5 ms

In [8]: !git describe --tag
0.14.6dev2-7-g9f544288f

It still has improved over 38 ms, but my claim of reducing to 4.7ms was erroneous (it was faster, due to it being incorrect).

@ndgrigorian I would prefer to merge this change, and defer further work to later PRs.

@ndgrigorian
Copy link
Collaborator

The timing now:

(dev_dpctl) opavlyk@opavlyk-mobl:~/repos/dpctl$ ipython
Python 3.9.12 (main, Jun  1 2022, 11:38:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((4096, 4096), dtype="f4")

In [3]: y = dpt.sum(x, axis=0)

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 25.2 ms, sys: 6.65 ms, total: 31.9 ms
Wall time: 33.8 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 10.6 ms, sys: 20.1 ms, total: 30.7 ms
Wall time: 30 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 13.4 ms, sys: 19 ms, total: 32.4 ms
Wall time: 31.1 ms

In [7]: %time y = dpt.sum(x, axis=0)
CPU times: user 10.4 ms, sys: 20.1 ms, total: 30.4 ms
Wall time: 29.5 ms

In [8]: !git describe --tag
0.14.6dev2-7-g9f544288f

It still has improved over 38 ms, but my claim of reducing to 4.7ms was erroneous (it was faster, due to it being incorrect).

@ndgrigorian I would prefer to merge this change, and defer further work to later PRs.

I agree. I'll approve this PR.

@oleksandr-pavlyk oleksandr-pavlyk merged commit b008b8b into master Aug 18, 2023
@oleksandr-pavlyk oleksandr-pavlyk deleted the reduction-changes branch August 18, 2023 18:29
@github-actions
Copy link

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev3=py310ha25a700_42 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants