Reduction performance #1364

oleksandr-pavlyk · 2023-08-23T14:20:00Z

This PR improves performance of reduction kernel with atomics

Contig implementation kernel gets a dedicated name (easier to spot in the output of onetrace)
Increases work-group size multiple of the size of sub-group
Changes the order in which workgroups tile the array from "along reduction axis" moves fastest to "along iteration axis" moves fastest.
Introduces a dedicated implementation for reduction over axis0 and deploys it.

Performance in the main trunk:

    In [1]: import dpctl.tensor as dpt

    In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(\
                           dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

    In [3]: %time y = dpt.sum(x, axis=0)
    CPU times: user 309 ms, sys: 128 ms, total: 437 ms
    Wall time: 473 ms

    In [4]: %time y = dpt.sum(x, axis=0)
    CPU times: user 132 ms, sys: 160 ms, total: 292 ms
    Wall time: 316 ms

    In [5]: %time y = dpt.sum(x, axis=0)
    CPU times: user 104 ms, sys: 185 ms, total: 289 ms
    Wall time: 312 ms

After this change:

    In [1]: import dpctl.tensor as dpt

    In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(\
                           dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

    In [3]: %time y = dpt.sum(x, axis=0)
    CPU times: user 136 ms, sys: 9.52 ms, total: 145 ms
    Wall time: 158 ms

    In [4]: %time y = dpt.sum(x, axis=0)
    CPU times: user 18.8 ms, sys: 17.3 ms, total: 36.1 ms
    Wall time: 42 ms

    In [5]: %time y = dpt.sum(x, axis=0)
    CPU times: user 19.2 ms, sys: 16.9 ms, total: 36.1 ms
    Wall time: 38.4 ms

    In [6]: %time y = dpt.sum(x, axis=0)
    CPU times: user 1.69 ms, sys: 35.2 ms, total: 36.9 ms
    Wall time: 39.4 ms

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you opening the PR as a draft?

github-actions · 2023-08-23T14:48:16Z

View rendered docs @ https://intelpython.github.io/dpctl/pulls/1364/index.html

coveralls · 2023-08-23T14:54:09Z

coverage: 85.635%. remained the same when pulling 32d4419 on reduction-performance into 9f98baf on master.

github-actions · 2023-08-23T15:32:41Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_14 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

github-actions · 2023-08-27T19:35:06Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_24 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

1. Contig implementation kernel gets a dedicated name (easier to spot in the output of onetrace) 2. Increase work-group multiple 3. Change the order in which workgroups tile the array from 'along reduction axis' moves fastest to 'along iteration axis' moves fastests. This last change contributes to significant performance improvement: ``` ================= Before change In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0) CPU times: user 309 ms, sys: 128 ms, total: 437 ms Wall time: 473 ms In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 132 ms, sys: 160 ms, total: 292 ms Wall time: 316 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 104 ms, sys: 185 ms, total: 289 ms Wall time: 312 ms ``` ``` ===== After change In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0) CPU times: user 150 ms, sys: 32.9 ms, total: 183 ms Wall time: 198 ms In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 20 ms, sys: 22.7 ms, total: 42.7 ms Wall time: 49.4 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 10.2 ms, sys: 28.9 ms, total: 39.1 ms Wall time: 41.4 ms In [6]: %time y = dpt.sum(x, axis=0) CPU times: user 23 ms, sys: 18 ms, total: 41 ms Wall time: 43.5 ms ```

This achieves additional savings over the prior commit: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0) CPU times: user 136 ms, sys: 9.52 ms, total: 145 ms Wall time: 158 ms In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 18.8 ms, sys: 17.3 ms, total: 36.1 ms Wall time: 42 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 19.2 ms, sys: 16.9 ms, total: 36.1 ms Wall time: 38.4 ms In [6]: %time y = dpt.sum(x, axis=0) CPU times: user 1.69 ms, sys: 35.2 ms, total: 36.9 ms Wall time: 39.4 ms In [7]: quit ``` Prior to this the wall time stood at 49 ms.

The logic was misguided, and based on the idea that if using max-work-group-size can lead to launching just a single work-group, then we can reduce everything within the work-group and not use atomics altogether. This lead to problems on CPU, where max-work-group-size is 8192, and max-work-group size was selected, but the total number of work-groups launched was high due to large iteration space size, and this resulted in severe underutilization of the device (low ocupancy).

Made changes similar to those made in kernels for atomic reduction. The WG's location change along iteration dimension the fastest (previously along reduction dimension the fastest). Due to this change performance of reduction increases 7-8x: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms Wall time: 316 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms Wall time: 43 ms In [5]: quit ``` While in the main branch: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 440 ms, sys: 129 ms, total: 569 ms Wall time: 514 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 159 ms, total: 301 ms Wall time: 325 ms In [5]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 154 ms, total: 296 ms Wall time: 325 ms In [6]: quit ```

github-actions · 2023-08-29T14:25:28Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_39 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

oleksandr-pavlyk · 2023-09-03T03:59:19Z

Ping.

ndgrigorian · 2023-09-03T04:54:55Z

@oleksandr-pavlyk
I wasn't sure if you were planning to add more to this PR.

If not, I'll go ahead and do a last look-over and approve it. I tested it out last week and found no correctness issues and the performance is much better.

github-actions · 2023-09-05T19:18:32Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

oleksandr-pavlyk added 4 commits August 28, 2023 20:33

oleksandr-pavlyk force-pushed the reduction-performance branch from 39c8bd9 to 32d4419 Compare August 29, 2023 13:11

oleksandr-pavlyk marked this pull request as ready for review August 29, 2023 13:12

oleksandr-pavlyk requested review from ndgrigorian, antonwolfy and npolina4 August 29, 2023 13:12

ndgrigorian approved these changes Sep 5, 2023

View reviewed changes

oleksandr-pavlyk merged commit 62e38de into master Sep 5, 2023

oleksandr-pavlyk deleted the reduction-performance branch September 5, 2023 19:18

antonwolfy mentioned this pull request Sep 6, 2023

dpt.sum over axis for transposed input array #1391

Closed

oleksandr-pavlyk mentioned this pull request Sep 11, 2023

Merge 0.14.6dev5 into gold/2021 #1396

Merged

6 tasks

ndgrigorian mentioned this pull request Sep 17, 2023

Boolean reduction performance improvements #1401

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduction performance #1364

Reduction performance #1364

Uh oh!

oleksandr-pavlyk commented Aug 23, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Aug 23, 2023

Uh oh!

coveralls commented Aug 23, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Aug 23, 2023

Uh oh!

github-actions bot commented Aug 27, 2023

Uh oh!

github-actions bot commented Aug 29, 2023

Uh oh!

oleksandr-pavlyk commented Sep 3, 2023

Uh oh!

ndgrigorian commented Sep 3, 2023

Uh oh!

github-actions bot commented Sep 5, 2023

Uh oh!

Uh oh!

Reduction performance #1364

Reduction performance #1364

Uh oh!

Conversation

oleksandr-pavlyk commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 23, 2023

Uh oh!

coveralls commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 23, 2023

Uh oh!

github-actions bot commented Aug 27, 2023

Uh oh!

github-actions bot commented Aug 29, 2023

Uh oh!

oleksandr-pavlyk commented Sep 3, 2023

Uh oh!

ndgrigorian commented Sep 3, 2023

Uh oh!

github-actions bot commented Sep 5, 2023

Uh oh!

Uh oh!

oleksandr-pavlyk commented Aug 23, 2023 •

edited

Loading

coveralls commented Aug 23, 2023 •

edited

Loading