Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid accessing slow .data in unstack #5906

Merged
merged 4 commits into from
Oct 29, 2021

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 28, 2021

@github-actions
Copy link
Contributor

github-actions bot commented Oct 28, 2021

Unit Test Results

         6 files           6 suites   56m 25s ⏱️
16 278 tests 14 550 ✔️ 1 728 💤 0
90 864 runs  82 692 ✔️ 8 172 💤 0

Results for commit 6363a76.

♻️ This comment has been updated with latest results.

@TomAugspurger
Copy link
Contributor Author

There are two changes here

  1. Only check the .data of non-index variables, done at https://github.com/pydata/xarray/pull/5906/files#diff-763e3002fd954d544b05858d8d138b828b66b6a2a0ae3cd58d2040a652f14638R4161-R4163
  2. The check for whether or not a full index was needed is done in a for dim in dims loop, but the condition doesn't actually depend on dim. So I lifted that check out of the for loop (doesn't matter much, since stuff is cached).

cc @dcherian

xarray/core/dataset.py Outdated Show resolved Hide resolved
@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Oct 28, 2021
# Dask arrays don't support assignment by index, which the fast unstack
# function requires.
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
any(is_duck_dask_array(v.data) for v in nonindexes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same loop is being used here for each of these checks.

  • would it be better to add them all in one loop?
  • since they all use any() I think we can break the loop at the first sight of true value.
  • sparse is a constant, waiting on a loop to finish before seems unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6363a76 (hopefully I got that right).

Agreed with doing it all in one loop since I suspect that perhaps allocating v.data will be more expensive that any of the checks, so we should avoid doing that as long as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a normal for loop you can remove two of the v.data.

for v in nonindexes:
    data_ = v.data
    ....

Might be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that after the initial access of v.data, which allocates the data if it isn't yet, subsequent ones will be fast. Or is there some case that I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that may be the case. I've seen though that just calling a property can be quite slow sometimes.

@Illviljan Illviljan added run-benchmark Run the ASV benchmark workflow and removed run-benchmark Run the ASV benchmark workflow labels Oct 28, 2021
@dcherian
Copy link
Contributor

Nice work! Our benchmarks show 2X speedup

@dcherian
Copy link
Contributor

Thanks @TomAugspurger

@dcherian dcherian merged commit b2ed62e into pydata:main Oct 29, 2021
@TomAugspurger TomAugspurger deleted the fix/5902-unstack-perf branch October 29, 2021 15:29
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 29, 2021
* main:
  Add typing_extensions as a required dependency (pydata#5911)
  pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914)
  Avoid accessing slow .data in unstack (pydata#5906)
  Add wradlib to ecosystem in docs (pydata#5915)
  Use .to_numpy() for quantified facetgrids (pydata#5886)
  [test-upstream] fix pd skipna=None (pydata#5899)
  Add var and std to weighted computations (pydata#5870)
  Check for path-like objects rather than Path type, use os.fspath (pydata#5879)
  Handle single `PathLike` objects in `open_mfdataset()` (pydata#5884)
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 29, 2021
* upstream/main:
  Add typing_extensions as a required dependency (pydata#5911)
  pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914)
  Avoid accessing slow .data in unstack (pydata#5906)
  Add wradlib to ecosystem in docs (pydata#5915)
  Use .to_numpy() for quantified facetgrids (pydata#5886)
  [test-upstream] fix pd skipna=None (pydata#5899)
  Add var and std to weighted computations (pydata#5870)
snowman2 pushed a commit to snowman2/xarray that referenced this pull request Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants