Skip to content

Development milestone 0.14.6dev4 #1354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Aug 18, 2023
Merged

Development milestone 0.14.6dev4 #1354

merged 44 commits into from
Aug 18, 2023

Conversation

oleksandr-pavlyk and others added 30 commits July 28, 2023 06:14
This improves performance 8x-fold:

```
In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones((4096, 4096), dtype="f4")

In [3]: y = dpt.sum(x, axis=0)

In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 2.64 ms, sys: 4.4 ms, total: 7.04 ms
Wall time: 10 ms

In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.93 ms, sys: 3.22 ms, total: 5.16 ms
Wall time: 4.74 ms

In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.7 ms, sys: 2.83 ms, total: 4.53 ms
Wall time: 4.1 ms

In [7]: %time y = dpt.sum(x, axis=0)
CPU times: user 1.98 ms, sys: 3.3 ms, total: 5.28 ms
Wall time: 4.7 ms
```

The timing before was around 38ms
- Adjusted to reduce branching and hopefully improve vectorization of the loop by removing a conditional
1. Removed unused usm_ndarray._clone static C-only method
2. Removed _dispatch* utilities
3. Used direct calls to unary/binary operators in implementation
   of special methods
Provide cabs private method implementating abs for complex types, paying
attention to array-API mandated special values.

To work-around gh-1279, use std::hypot to compute value for finite inputs.
Compile with -DUSE_STD_ABS_FOR_COMPLEX_TYPES to use std::abs(z) instead of
std::hypot(std::real(z), std::imag(z)).
This change provides private method csqrt to evaluate square-root
for complex types. It handles special values as mandated by array API.

The finite input, it provides its own implementation based on std::hypot
and std::sqrt for real types instead of calling std::sqrt on finite
input of complex type.

Compile with -DUSE_STD_SQRT_FOR_COMPLEX_TYPES to use std::sqrt instead
of custom implementation.

Cursory performance study suggests that custom implementation is at least
not worse than std::sqrt one.
This utility function is based on symmetric check, unlike numpy.allclose,
and verifies that abs(x1-x2) < atol + rtol * max(abs(x1), abs(x2))

This way allclose(x1, x2) is symmetric, and allclose(x1,x2) implies
allclose(x2, x1).
The intel/llvm/pull/10551 has been merged, so the build should succeed
and produce working binary.

The intel/llvm project has transitioned from sycl-nightly/YYYYMMDD tags to
nightly-YYYY-MM-DD tags instead.

The artifact of intel/llvm nightly build has also changed the name
and the structure. Adjusting the code for that.
…bundle

Use latest sycl bundle to build DPCTL
test_sycl_queue.py::test_cython_api requires a compiler to build
a native extension.
Adds a simple C extension, compiled with C compiler that includes
dpctl_capi header file. This mimics use dpctl_capi from numba_dpex.
1. Aligned default values with those of np.allclose
2. Replaced less test with less_equal to align with NumPy.
Also added tests for early exits to improve coverage.
This changes builds up on gh-1265 and takes into account queue
from the pre-allocated buffer, if provided.
dpctl.tensor.asarray implementation of order='K' processing
was replaced with tested _empty_like_orderK utility to fix
the issue reported in gh-1350.

Few routines had to be shuffled to avoid import failure due to
circular import dependencies.
oleksandr-pavlyk and others added 14 commits August 17, 2023 03:57
This improves accuracy at extremes of supported range.

Use sycl:: namespace ldexp and ilogb to prevent problem
with VS 2017 headers.
Fix bad order=K code logic in tensor.asarray
Reworked text based per PR feedback.
Reworked text based in PR feedback
Reworked text based on PR feedback
ilogb would have to pay attention to correctly computing
scale of denormal floats, while simpler code suffices.

Also use unscaled version in most cases, and scaled version
only for very large inputs.
We work around issues with these functions when their implementation
is taken from VS 2017 headers on Windows though.
Update README for wheel installation
Improvement to performance of tensor.sum
* Where result now keeps order of operands
- Now when operands are cast, stride simplification can still be performed on non-C contiguous inputs
- Implements _empty_like_triple_orderK to allocate output of where

* Adds test for correct order="K" behavior in where

* Adjusted logic in _empty_like_triple_orderK
- Now calls _empty_like_pair_orderK when two arrays are of equal shape and larger than the third

* Changes to order "K" stride sorting
- Dimensions of size 1 are effectively disregarded in sorting

* Fixed typo in _empty_like_orderK
* Binary elementwise functions can now act on any input in-place
- A temporary will be allocated as necessary (i.e., when arrays overlap, are not going to be cast, and are not the same logical arrays)
- Uses dedicated in-place kernels where they are implemented
- Now called directly by Python operators
- Removes _inplace method of BinaryElementwiseFunc class
- Removes _find_inplace_dtype function

* Tests for new out parameter behavior for add

* Broadcasting made conditional in binary functions where memory overlap is possible
- Broadcasting can change the values of strides without changing array shape

* Changed exception types raised

Use ExecutionPlacementError for CFD violations.
Use ValueError is types of input are as expected, but values are
not as expected.

* Adding tests to improve coverage

Removed tests expecting error raised in case of overlapping inputs.
Added tests guided by coverage report.

* Removed provably unreachable branches in _resolve_weak_types

Since o1_dtype_kind_num > o2_dtype_kind_num, o1 can be not be
weak boolean type, since it has the lowest kind number in the
hierarchy.

* All in-place operators now use call operator of BinaryElementwiseFunc

* Removed some redundant and obsolete tests
- Removed from test_floor_ceil_trunc, test_hyperbolic, test_trigonometric, and test_logaddexp
- These tests would fail on GPU but never run on CPU, and therefore were not impacting the coverage
- These tests focused on aspects of the BinaryElementwiseFunc class rather than the behavior of the operator

---------

Co-authored-by: Oleksandr Pavlyk <oleksandr.pavlyk@intel.com>
@github-actions
Copy link

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

@github-actions
Copy link

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_33 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

1 similar comment
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev4=py310ha25a700_33 ran successfully.
Passed: 913
Failed: 87
Skipped: 119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants