You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Change WG traversal pattern in tree reduction kernel
Made changes similar to those made in kernels for atomic
reduction. The WG's location change along iteration dimension
the fastest (previously along reduction dimension the fastest).
Due to this change performance of reduction increases 7-8x:
```
In [1]: import dpctl.tensor as dpt
In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128))
In [3]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 284 ms, sys: 3.68 ms, total: 287 ms
Wall time: 316 ms
In [4]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 18.6 ms, sys: 18.9 ms, total: 37.5 ms
Wall time: 43 ms
In [5]: quit
```
While in the main branch:
```
In [1]: import dpctl.tensor as dpt
In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f2")), (1282200, 128))
In [3]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 440 ms, sys: 129 ms, total: 569 ms
Wall time: 514 ms
In [4]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 142 ms, sys: 159 ms, total: 301 ms
Wall time: 325 ms
In [5]: %time y = dpt.sum(x, axis=0, dtype="f2")
CPU times: user 142 ms, sys: 154 ms, total: 296 ms
Wall time: 325 ms
In [6]: quit
```
0 commit comments