Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid accessing slow .data in unstack #5906

Merged
merged 4 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ Bug fixes
By `Maxime Liquet <https://github.com/maximlt>`_.
- ``open_mfdataset()`` now accepts a single ``pathlib.Path`` object (:issue: `5881`).
By `Panos Mavrogiorgos <https://github.com/pmav99>`_.
- Improved performance of :py:meth:`Dataset.unstack` (:pull:`5906`). By `Tom Augspurger <https://github.com/TomAugspurger>`_.

Documentation
~~~~~~~~~~~~~
Expand Down
58 changes: 31 additions & 27 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4153,34 +4153,38 @@ def unstack(
)

result = self.copy(deep=False)
for dim in dims:

if (
# Dask arrays don't support assignment by index, which the fast unstack
# function requires.
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
any(is_duck_dask_array(v.data) for v in self.variables.values())
# Sparse doesn't currently support (though we could special-case
# it)
# https://github.com/pydata/sparse/issues/422
or any(
isinstance(v.data, sparse_array_type)
for v in self.variables.values()
)
or sparse
# Until https://github.com/pydata/xarray/pull/4751 is resolved,
# we check explicitly whether it's a numpy array. Once that is
# resolved, explicitly exclude pint arrays.
# # pint doesn't implement `np.full_like` in a way that's
# # currently compatible.
# # https://github.com/pydata/xarray/pull/4746#issuecomment-753425173
# # or any(
# # isinstance(v.data, pint_array_type) for v in self.variables.values()
# # )
or any(
not isinstance(v.data, np.ndarray) for v in self.variables.values()
)
):
# we want to avoid allocating an object-dtype ndarray for a MultiIndex,
# so we can't just access self.variables[v].data for every variable.
# We only check the non-index variables.
# https://github.com/pydata/xarray/issues/5902
nonindexes = [
self.variables[k] for k in set(self.variables) - set(self.indexes)
dcherian marked this conversation as resolved.
Show resolved Hide resolved
]
needs_full_reindex = (
# Dask arrays don't support assignment by index, which the fast unstack
# function requires.
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
any(is_duck_dask_array(v.data) for v in nonindexes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same loop is being used here for each of these checks.

  • would it be better to add them all in one loop?
  • since they all use any() I think we can break the loop at the first sight of true value.
  • sparse is a constant, waiting on a loop to finish before seems unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6363a76 (hopefully I got that right).

Agreed with doing it all in one loop since I suspect that perhaps allocating v.data will be more expensive that any of the checks, so we should avoid doing that as long as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a normal for loop you can remove two of the v.data.

for v in nonindexes:
    data_ = v.data
    ....

Might be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that after the initial access of v.data, which allocates the data if it isn't yet, subsequent ones will be fast. Or is there some case that I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that may be the case. I've seen though that just calling a property can be quite slow sometimes.

# Sparse doesn't currently support (though we could special-case
# it)
# https://github.com/pydata/sparse/issues/422
or any(isinstance(v.data, sparse_array_type) for v in nonindexes)
or sparse
# Until https://github.com/pydata/xarray/pull/4751 is resolved,
# we check explicitly whether it's a numpy array. Once that is
# resolved, explicitly exclude pint arrays.
# # pint doesn't implement `np.full_like` in a way that's
# # currently compatible.
# # https://github.com/pydata/xarray/pull/4746#issuecomment-753425173
# # or any(
# # isinstance(v.data, pint_array_type) for v in self.variables.values()
# # )
or any(not isinstance(v.data, np.ndarray) for v in nonindexes)
)

for dim in dims:
if needs_full_reindex:
result = result._unstack_full_reindex(dim, fill_value, sparse)
else:
result = result._unstack_once(dim, fill_value)
Expand Down