-
Notifications
You must be signed in to change notification settings - Fork 30
Reduction performance #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduction performance #1364
Conversation
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1364/index.html |
Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_14 ran successfully. |
Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_24 ran successfully. |
1. Contig implementation kernel gets a dedicated name (easier to spot in the output of onetrace) 2. Increase work-group multiple 3. Change the order in which workgroups tile the array from 'along reduction axis' moves fastest to 'along iteration axis' moves fastests. This last change contributes to significant performance improvement: ``` ================= Before change In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0) CPU times: user 309 ms, sys: 128 ms, total: 437 ms Wall time: 473 ms In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 132 ms, sys: 160 ms, total: 292 ms Wall time: 316 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 104 ms, sys: 185 ms, total: 289 ms Wall time: 312 ms ``` ``` ===== After change In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0) CPU times: user 150 ms, sys: 32.9 ms, total: 183 ms Wall time: 198 ms In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 20 ms, sys: 22.7 ms, total: 42.7 ms Wall time: 49.4 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 10.2 ms, sys: 28.9 ms, total: 39.1 ms Wall time: 41.4 ms In [6]: %time y = dpt.sum(x, axis=0) CPU times: user 23 ms, sys: 18 ms, total: 41 ms Wall time: 43.5 ms ```
This achieves additional savings over the prior commit: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0) CPU times: user 136 ms, sys: 9.52 ms, total: 145 ms Wall time: 158 ms In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 18.8 ms, sys: 17.3 ms, total: 36.1 ms Wall time: 42 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 19.2 ms, sys: 16.9 ms, total: 36.1 ms Wall time: 38.4 ms In [6]: %time y = dpt.sum(x, axis=0) CPU times: user 1.69 ms, sys: 35.2 ms, total: 36.9 ms Wall time: 39.4 ms In [7]: quit ``` Prior to this the wall time stood at 49 ms.
The logic was misguided, and based on the idea that if using max-work-group-size can lead to launching just a single work-group, then we can reduce everything within the work-group and not use atomics altogether. This lead to problems on CPU, where max-work-group-size is 8192, and max-work-group size was selected, but the total number of work-groups launched was high due to large iteration space size, and this resulted in severe underutilization of the device (low ocupancy).
Made changes similar to those made in kernels for atomic reduction. The WG's location change along iteration dimension the fastest (previously along reduction dimension the fastest). Due to this change performance of reduction increases 7-8x: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms Wall time: 316 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms Wall time: 43 ms In [5]: quit ``` While in the main branch: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128)) In [3]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 440 ms, sys: 129 ms, total: 569 ms Wall time: 514 ms In [4]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 159 ms, total: 301 ms Wall time: 325 ms In [5]: %time y = dpt.sum(x, axis=0, dtype="f2") CPU times: user 142 ms, sys: 154 ms, total: 296 ms Wall time: 325 ms In [6]: quit ```
39c8bd9
to
32d4419
Compare
Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_39 ran successfully. |
Ping. |
@oleksandr-pavlyk If not, I'll go ahead and do a last look-over and approve it. I tested it out last week and found no correctness issues and the performance is much better. |
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
This PR improves performance of reduction kernel with atomics
Performance in the main trunk:
After this change: