-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some alignment optimizations #7382
Conversation
This may happen in some (rare?) cases where the objects to align share the same indexes.
If all unindexed dimension sizes match the indexed dimension sizes in the objects to align, we don't need re-indexing.
Quick benchmark taking the example in #7376 (it seems even much faster than in version 2022.3.0!) # version 2022.3.0
%timeit ds.assign(foo=~ds["d3"])
# 22.5 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# main branch
%timeit ds.assign(foo=~ds["d3"])
# 193 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this PR
%timeit ds.assign(foo=~ds["d3"])
# 1.01 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
No benchmark is catching this? Maybe we can add a small one in https://github.com/pydata/xarray/blob/main/asv_bench/benchmarks/indexing.py ? |
@@ -1419,6 +1419,11 @@ def check_variables(): | |||
) | |||
|
|||
indexes = [e[0] for e in elements] | |||
|
|||
same_objects = all(indexes[0] is other_idx for other_idx in indexes[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same_objects = all(indexes[0] is other_idx for other_idx in indexes[1:]) | |
indexes_0 = indexes[0] | |
same_objects = all(indexes_0 is other_idx for other_idx in indexes[1:]) |
No need to use getitem several times for the same value. Similar thing can be done in other places in the function as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List indexing is O(1), so should be only a marginal speedup :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think the gain in perf would be negligible.
I don't know if the optimizations added here will benefit a large set of use cases (it took 6 months before seeing an issue report), but it is worth for at least a few of them. This is ready I think (added some benchmarks). |
Thanks, @benbovy ! |
* main: (41 commits) v2023.01.0 whats-new (pydata#7440) explain keep_attrs in docstring of apply_ufunc (pydata#7445) Add sentence to open_dataset docstring (pydata#7438) pin scipy version in doc environment (pydata#7436) Improve performance for backend datetime handling (pydata#7374) fix typo (pydata#7433) Add lazy backend ASV test (pydata#7426) Pull Request Labeler - Workaround sync-labels bug (pydata#7431) see also : groupby in resample doc and vice-versa (pydata#7425) Some alignment optimizations (pydata#7382) Make `broadcast` and `concat` work with the Array API (pydata#7387) remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416) [pre-commit.ci] pre-commit autoupdate (pydata#7402) Preserve original dtype when accessing MultiIndex levels (pydata#7393) [pre-commit.ci] pre-commit autoupdate (pydata#7389) [pre-commit.ci] pre-commit autoupdate (pydata#7360) COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361) Avoid loading entire dataset by getting the nbytes in an array (pydata#7356) `keep_attrs` for pad (pydata#7267) Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375) ...
whats-new.rst
May fix some performance regressions, e.g., see #7376 (comment).
@ravwojdyla with this PR
ds.assign(foo=~ds["d3"])
in your example should be much faster (on par with version 2022.3.0).