Development milestone 0.14.6dev4 #1354

oleksandr-pavlyk · 2023-08-18T21:12:07Z

This PR is developmental milestone, containing the following changes after 0.14.6dev3:

This improves performance 8x-fold: ``` In [1]: import dpctl.tensor as dpt In [2]: x = dpt.ones((4096, 4096), dtype="f4") In [3]: y = dpt.sum(x, axis=0) In [4]: %time y = dpt.sum(x, axis=0) CPU times: user 2.64 ms, sys: 4.4 ms, total: 7.04 ms Wall time: 10 ms In [5]: %time y = dpt.sum(x, axis=0) CPU times: user 1.93 ms, sys: 3.22 ms, total: 5.16 ms Wall time: 4.74 ms In [6]: %time y = dpt.sum(x, axis=0) CPU times: user 1.7 ms, sys: 2.83 ms, total: 4.53 ms Wall time: 4.1 ms In [7]: %time y = dpt.sum(x, axis=0) CPU times: user 1.98 ms, sys: 3.3 ms, total: 5.28 ms Wall time: 4.7 ms ``` The timing before was around 38ms

- Adjusted to reduce branching and hopefully improve vectorization of the loop by removing a conditional

1. Removed unused usm_ndarray._clone static C-only method 2. Removed _dispatch* utilities 3. Used direct calls to unary/binary operators in implementation of special methods

Provide cabs private method implementating abs for complex types, paying attention to array-API mandated special values. To work-around gh-1279, use std::hypot to compute value for finite inputs. Compile with -DUSE_STD_ABS_FOR_COMPLEX_TYPES to use std::abs(z) instead of std::hypot(std::real(z), std::imag(z)).

This change provides private method csqrt to evaluate square-root for complex types. It handles special values as mandated by array API. The finite input, it provides its own implementation based on std::hypot and std::sqrt for real types instead of calling std::sqrt on finite input of complex type. Compile with -DUSE_STD_SQRT_FOR_COMPLEX_TYPES to use std::sqrt instead of custom implementation. Cursory performance study suggests that custom implementation is at least not worse than std::sqrt one.

This utility function is based on symmetric check, unlike numpy.allclose, and verifies that abs(x1-x2) < atol + rtol * max(abs(x1), abs(x2)) This way allclose(x1, x2) is symmetric, and allclose(x1,x2) implies allclose(x2, x1).

The intel/llvm/pull/10551 has been merged, so the build should succeed and produce working binary. The intel/llvm project has transitioned from sycl-nightly/YYYYMMDD tags to nightly-YYYY-MM-DD tags instead. The artifact of intel/llvm nightly build has also changed the name and the structure. Adjusting the code for that.

…bundle Use latest sycl bundle to build DPCTL

test_sycl_queue.py::test_cython_api requires a compiler to build a native extension.

Adds a simple C extension, compiled with C compiler that includes dpctl_capi header file. This mimics use dpctl_capi from numba_dpex.

Correct typo in an exception text

1. Aligned default values with those of np.allclose 2. Replaced less test with less_equal to align with NumPy.

Also added tests for early exits to improve coverage.

Test environment requires compilers

This changes builds up on gh-1265 and takes into account queue from the pre-allocated buffer, if provided.

dpctl.tensor.asarray implementation of order='K' processing was replaced with tested _empty_like_orderK utility to fix the issue reported in gh-1350. Few routines had to be shuffled to avoid import failure due to circular import dependencies.

This improves accuracy at extremes of supported range. Use sycl:: namespace ldexp and ilogb to prevent problem with VS 2017 headers.

Fix bad order=K code logic in tensor.asarray

Reworked text based per PR feedback.

Reworked text based in PR feedback

Reworked text based on PR feedback

ilogb would have to pay attention to correctly computing scale of denormal floats, while simpler code suffices. Also use unscaled version in most cases, and scaled version only for very large inputs.

We work around issues with these functions when their implementation is taken from VS 2017 headers on Windows though.

Update README for wheel installation

Fix gh-1279, implement tensor.allclose

Improvement to performance of tensor.sum

* Where result now keeps order of operands - Now when operands are cast, stride simplification can still be performed on non-C contiguous inputs - Implements _empty_like_triple_orderK to allocate output of where * Adds test for correct order="K" behavior in where * Adjusted logic in _empty_like_triple_orderK - Now calls _empty_like_pair_orderK when two arrays are of equal shape and larger than the third * Changes to order "K" stride sorting - Dimensions of size 1 are effectively disregarded in sorting * Fixed typo in _empty_like_orderK

* Binary elementwise functions can now act on any input in-place - A temporary will be allocated as necessary (i.e., when arrays overlap, are not going to be cast, and are not the same logical arrays) - Uses dedicated in-place kernels where they are implemented - Now called directly by Python operators - Removes _inplace method of BinaryElementwiseFunc class - Removes _find_inplace_dtype function * Tests for new out parameter behavior for add * Broadcasting made conditional in binary functions where memory overlap is possible - Broadcasting can change the values of strides without changing array shape * Changed exception types raised Use ExecutionPlacementError for CFD violations. Use ValueError is types of input are as expected, but values are not as expected. * Adding tests to improve coverage Removed tests expecting error raised in case of overlapping inputs. Added tests guided by coverage report. * Removed provably unreachable branches in _resolve_weak_types Since o1_dtype_kind_num > o2_dtype_kind_num, o1 can be not be weak boolean type, since it has the lowest kind number in the hierarchy. * All in-place operators now use call operator of BinaryElementwiseFunc * Removed some redundant and obsolete tests - Removed from test_floor_ceil_trunc, test_hyperbolic, test_trigonometric, and test_logaddexp - These tests would fail on GPU but never run on CPU, and therefore were not impacting the coverage - These tests focused on aspects of the BinaryElementwiseFunc class rather than the behavior of the operator --------- Co-authored-by: Oleksandr Pavlyk <oleksandr.pavlyk@intel.com>

github-actions · 2023-08-18T21:13:27Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2023-08-18T21:39:45Z

View rendered docs @ https://intelpython.github.io/dpctl/pulls/1354/index.html

github-actions · 2023-08-18T22:23:19Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_33 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

github-actions · 2023-08-19T05:46:44Z

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_33 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

oleksandr-pavlyk and others added 30 commits July 28, 2023 06:14

Replaced nd_range<2> with nd_range<1> for non-atomic case as well

2f79acb

Add test based on example from @ndgrigorian's feedback to PR

5c4f980

Boolean reductions transitioned from nd_range<2> to nd_range<1>

a5aee5b

Strided boolean reduction loop tweak

da1cc5b

- Adjusted to reduce branching and hopefully improve vectorization of the loop by removing a conditional

Merge remote-tracking branch 'origin/master' into reduction-changes

63b2799

Specify name for the atomic reduction initialization kernel

9f54428

Clean up of operator special methods

48c2ad2

1. Removed unused usm_ndarray._clone static C-only method 2. Removed _dispatch* utilities 3. Used direct calls to unary/binary operators in implementation of special methods

Removed leftover include iostream

7d9974e

Implements dpctl.tensor.allclose

26862b4

This utility function is based on symmetric check, unlike numpy.allclose, and verifies that abs(x1-x2) < atol + rtol * max(abs(x1), abs(x2)) This way allclose(x1, x2) is symmetric, and allclose(x1,x2) implies allclose(x2, x1).

Adds tests for special FP values for dpt.abs and dpt.sqrt

b121d67

Added tests for type promotion in tensor.allclose

d60d58e

Run on Ubuntu 22.04, since nightly sycl bundle requires newer GLIBC

2dda06c

Copy OpenCL loader from oclcpu to compiler's lib

9056477

Merge pull request #1346 from IntelPython/use-latest-intel-llvm-sycl-…

9afdb58

…bundle Use latest sycl bundle to build DPCTL

Test environment requires compilers

0ed60a9

test_sycl_queue.py::test_cython_api requires a compiler to build a native extension.

Corrected typo in exception text

ee71562

Adding test to prevent gh-1344 from reoccuring

2256fb3

Adds a simple C extension, compiled with C compiler that includes dpctl_capi header file. This mimics use dpctl_capi from numba_dpex.

Updated copyright year range

7f7897c

Merge pull request #1349 from IntelPython/typo_in_exception

02fd974

Correct typo in an exception text

Fixes per PR review feedback

ea6dd27

1. Aligned default values with those of np.allclose 2. Replaced less test with less_equal to align with NumPy.

Adds tests for atol/rtol

f36af57

Also added tests for early exits to improve coverage.

tensor.allclose to use abs(a-b) < max(atol, rtol*max(abs(a), abs(b)))

a75fff8

Merge pull request #1348 from IntelPython/test-requires-compilers

47c82f5

Test environment requires compilers

Completion of fix for gh-1058

ff3c680

This changes builds up on gh-1265 and takes into account queue from the pre-allocated buffer, if provided.

Closes gh-1350

53e850d

dpctl.tensor.asarray implementation of order='K' processing was replaced with tested _empty_like_orderK utility to fix the issue reported in gh-1350. Few routines had to be shuffled to avoid import failure due to circular import dependencies.

Added test based on gh-1350

43b9c06

oleksandr-pavlyk and others added 14 commits August 17, 2023 03:57

Scale down arguments and scale back the result

c4312cb

This improves accuracy at extremes of supported range. Use sycl:: namespace ldexp and ilogb to prevent problem with VS 2017 headers.

Merge pull request #1351 from IntelPython/fix-gh-1350

9c18949

Fix bad order=K code logic in tensor.asarray

Update README for wheel installation

1c09c66

Update README.md

ec60e9f

Reworked text based per PR feedback.

Update README.md

ea83961

Reworked text based in PR feedback

Update README.md

1468978

Reworked text based on PR feedback

Avoid using sycl::ilogb, but use own implementation

bb52bb1

ilogb would have to pay attention to correctly computing scale of denormal floats, while simpler code suffices. Also use unscaled version in most cases, and scaled version only for very large inputs.

Set defines to use std::abs and std::sqrt on Linux

ba9a595

We work around issues with these functions when their implementation is taken from VS 2017 headers on Windows though.

Removed stray include iostream

142190f

Merge pull request #1353 from IntelPython/fix/install

bd996b5

Update README for wheel installation

Merge pull request #1343 from IntelPython/fix-gh-1279

2f3be1f

Fix gh-1279, implement tensor.allclose

Merge pull request #1303 from IntelPython/reduction-changes

b008b8b

Improvement to performance of tensor.sum

oleksandr-pavlyk requested review from antonwolfy, ndgrigorian and xaleryb August 18, 2023 21:12

oleksandr-pavlyk merged commit 5a9d816 into gold/2021 Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development milestone 0.14.6dev4 #1354

Development milestone 0.14.6dev4 #1354

Uh oh!

oleksandr-pavlyk commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 19, 2023

Uh oh!

Uh oh!

Development milestone 0.14.6dev4 #1354

Development milestone 0.14.6dev4 #1354

Uh oh!

Conversation

oleksandr-pavlyk commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 18, 2023

Uh oh!

github-actions bot commented Aug 19, 2023

Uh oh!

Uh oh!