Skip to content

Reduction performance #1364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 5, 2023
Merged

Reduction performance #1364

merged 4 commits into from
Sep 5, 2023

Conversation

oleksandr-pavlyk
Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk commented Aug 23, 2023

This PR improves performance of reduction kernel with atomics

  1. Contig implementation kernel gets a dedicated name (easier to spot in the output of onetrace)
  2. Increases work-group size multiple of the size of sub-group
  3. Changes the order in which workgroups tile the array from "along reduction axis" moves fastest to "along iteration axis" moves fastest.
  4. Introduces a dedicated implementation for reduction over axis0 and deploys it.

Performance in the main trunk:

    In [1]: import dpctl.tensor as dpt

    In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(\
                           dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

    In [3]: %time y = dpt.sum(x, axis=0)
    CPU times: user 309 ms, sys: 128 ms, total: 437 ms
    Wall time: 473 ms

    In [4]: %time y = dpt.sum(x, axis=0)
    CPU times: user 132 ms, sys: 160 ms, total: 292 ms
    Wall time: 316 ms

    In [5]: %time y = dpt.sum(x, axis=0)
    CPU times: user 104 ms, sys: 185 ms, total: 289 ms
    Wall time: 312 ms

After this change:

    In [1]: import dpctl.tensor as dpt

    In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(\
                           dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

    In [3]: %time y = dpt.sum(x, axis=0)
    CPU times: user 136 ms, sys: 9.52 ms, total: 145 ms
    Wall time: 158 ms

    In [4]: %time y = dpt.sum(x, axis=0)
    CPU times: user 18.8 ms, sys: 17.3 ms, total: 36.1 ms
    Wall time: 42 ms

    In [5]: %time y = dpt.sum(x, axis=0)
    CPU times: user 19.2 ms, sys: 16.9 ms, total: 36.1 ms
    Wall time: 38.4 ms

    In [6]: %time y = dpt.sum(x, axis=0)
    CPU times: user 1.69 ms, sys: 35.2 ms, total: 36.9 ms
    Wall time: 39.4 ms
  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

@github-actions
Copy link

@coveralls
Copy link
Collaborator

coveralls commented Aug 23, 2023

Coverage Status

coverage: 85.635%. remained the same when pulling 32d4419 on reduction-performance into 9f98baf on master.

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_14 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_24 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

1. Contig implementation kernel gets a dedicated name
  (easier to spot in the output of onetrace)
2. Increase work-group multiple
3. Change the order in which workgroups tile the array
   from 'along reduction axis' moves fastest to
   'along iteration axis' moves fastests.

This last change contributes to significant performance improvement:

```
================= Before change

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0)
CPU times: user 309 ms, sys: 128 ms, total: 437 ms
Wall time: 473 ms

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 132 ms, sys: 160 ms, total: 292 ms
Wall time: 316 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 104 ms, sys: 185 ms, total: 289 ms
Wall time: 312 ms
```

```
===== After change

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0)
CPU times: user 150 ms, sys: 32.9 ms, total: 183 ms
Wall time: 198 ms

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 20 ms, sys: 22.7 ms, total: 42.7 ms
Wall time: 49.4 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 10.2 ms, sys: 28.9 ms, total: 39.1 ms
Wall time: 41.4 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 23 ms, sys: 18 ms, total: 41 ms
Wall time: 43.5 ms
```
This achieves additional savings over the prior commit:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0)
CPU times: user 136 ms, sys: 9.52 ms, total: 145 ms
Wall time: 158 ms

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 18.8 ms, sys: 17.3 ms, total: 36.1 ms
Wall time: 42 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 19.2 ms, sys: 16.9 ms, total: 36.1 ms
Wall time: 38.4 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.69 ms, sys: 35.2 ms, total: 36.9 ms
Wall time: 39.4 ms

In [7]: quit
```

Prior to this the wall time stood at 49 ms.
The logic was misguided, and based on the idea that if
using max-work-group-size can lead to launching just a
single work-group, then we can reduce everything within
the work-group and not use atomics altogether.

This lead to problems on CPU, where max-work-group-size is 8192,
and max-work-group size was selected, but the total number of
work-groups launched was high due to large iteration space size,
and this resulted in severe underutilization of the device (low
ocupancy).
Made changes similar to those made in kernels for atomic
reduction. The WG's location change along iteration dimension
the fastest (previously along reduction dimension the fastest).

Due to this change performance of reduction increases 7-8x:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms
Wall time: 316 ms

In [4]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms
Wall time: 43 ms

In [5]: quit
```

While in the main branch:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128))

In [3]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 440 ms, sys: 129 ms, total: 569 ms
Wall time: 514 ms

In [4]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 142 ms, sys: 159 ms, total: 301 ms
Wall time: 325 ms

In [5]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 142 ms, sys: 154 ms, total: 296 ms
Wall time: 325 ms

In [6]: quit
```
@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review August 29, 2023 13:12
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_39 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

@oleksandr-pavlyk
Copy link
Contributor Author

Ping.

@ndgrigorian
Copy link
Collaborator

@oleksandr-pavlyk
I wasn't sure if you were planning to add more to this PR.

If not, I'll go ahead and do a last look-over and approve it. I tested it out last week and found no correctness issues and the performance is much better.

@oleksandr-pavlyk oleksandr-pavlyk merged commit 62e38de into master Sep 5, 2023
@oleksandr-pavlyk oleksandr-pavlyk deleted the reduction-performance branch September 5, 2023 19:18
@github-actions
Copy link

github-actions bot commented Sep 5, 2023

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants