Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster unstacking #4746

Merged
merged 16 commits into from
Jan 24, 2021
Merged

Faster unstacking #4746

merged 16 commits into from
Jan 24, 2021

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Dec 31, 2020

asv profile unstacking.Unstacking.time_unstack_slow runs in 10ms now, down from 1076ms.

Todos:

  • Resolve how to handle sparse arrays — they currently go through the existing functions, and we have one test failure
  • Decide what to do with the existing functions — in particular Variable.unstack wouldn't work with the new code, since it requires the index which is being unstacked, and variables don't have indexes. The existing function will fail on any missing indexes, but may still be useful.
  • Clean up the code a bit — remove _fast from names where possible / remove some dead comments
  • Passes isort . && black . && mypy . && flake8
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@max-sixty
Copy link
Collaborator Author

Any ideas on how sparse arrays should be handled in unstack? Currently we use reindex, so this seems to pass through without much effort on our part.

In the new code, we're creating an array with np.full and then assigning to the appropriate locations. Can we do something similar that's not dependent on the underlying numpy / sparse / dask storage?

@shoyer
Copy link
Member

shoyer commented Jan 1, 2021

Very nice!

I think we solve the issue with sparse by using np.full_like, which does dispatching on NumPy 1.17+ via NEP-18 __array_function__ (which I'm pretty sure sparse supports).

The bigger challenge that I'm concerned about here are dask arrays, which don't support array assignment like this at all (dask/dask#2000). We will probably need to keep around the slower option for dask, at least for now.

@max-sixty
Copy link
Collaborator Author

Great, thanks, I'm making that change.

Is there any need to keep the sparse kwarg? My inclination is to remove it and retain types — so to get a sparse array back, convert to sparse before unstacking?

@max-sixty
Copy link
Collaborator Author

Would anyone be able be familiar with this pint error? https://dev.azure.com/xarray/xarray/_build/results?buildId=4650&view=ms.vss-test-web.build-test-results-tab&runId=73654&resultId=111454&paneView=debug

It seems to be failing on the assignment: data[(..., *indexer)] = reordered, rather than anything specific to unstacking.

Here's the stack trace from there:

/home/vsts/work/1/s/xarray/core/variable.py:1627: in _unstack_once
    data[(..., *indexer)] = reordered
/home/vsts/work/1/s/xarray/core/common.py:131: in __array__
    return np.asarray(self.values, dtype=dtype)
/home/vsts/work/1/s/xarray/core/variable.py:543: in values
    return _as_array_or_item(self._data)
/home/vsts/work/1/s/xarray/core/variable.py:275: in _as_array_or_item
    data = np.asarray(data)
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/numpy/core/_asarray.py:83: in asarray
    return array(a, dtype, copy=False, order=order)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Quantity([ 0  0  0  0  0  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4
  4  5  5  5  5  5  6  6  6  6  6  7  7  7  7  7  8  8  8  8  8  9  9  9
  9 10], 'meter')>
t = None

    def __array__(self, t=None):
>       warnings.warn(
            "The unit of the quantity is stripped when downcasting to ndarray.",
            UnitStrippedWarning,
            stacklevel=2,
        )
E       pint.errors.UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.

/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/pint/quantity.py:1683: UnitStrippedWarning

Worst case, I can direct pint arrays to the existing unstack path, but ideally this would work.

@shoyer
Copy link
Member

shoyer commented Jan 2, 2021

@keewis any thoughts on the pint issue?

@jthielen
Copy link
Contributor

jthielen commented Jan 2, 2021

@max-sixty That error seems to be arising because data is an numpy.ndarray and reordered is an xarray.Variable with a pint.Quantity inside, hence trying to assign a unit-aware array to an ndarray causes the units to be stripped.

Perhaps relevant to the discussion is this Pint issue hgrecco/pint#882 where it was tentatively decided that Pint's wrapped implementation of np.full_like would base its units off of fill_value alone, so as it is now, I think _unstack_once would need to create a Quantity-wrapped nan in the units of self.data for Pint to create an Quantity filled with nans (if I'm interpreting the implementation here correctly and that is what is needed). This seems like something where units-on-the-dtype would be useful, but alas, things aren't there yet!

@max-sixty
Copy link
Collaborator Author

Thanks for responding @jthielen .

Yes, so full_like isn't creating a pint array:

(Pdb) self.data
<Quantity([ 0.          0.20408163  0.40816327  0.6122449   0.81632653  1.02040816
  1.2244898   1.42857143  1.63265306  1.83673469  2.04081633  2.24489796
  2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  5.51020408  5.71428571  5.91836735
  6.12244898  6.32653061  6.53061224  6.73469388  6.93877551  7.14285714
  7.34693878  7.55102041  7.75510204  7.95918367  8.16326531  8.36734694
  8.57142857  8.7755102   8.97959184  9.18367347  9.3877551   9.59183673
  9.79591837 10.        ], 'meter')>
(Pdb) np.full_like(self.data, fill_value=fill_value)
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
(Pdb) fill_value
nan

I do think this is a bit surprising — while fill_value isn't typed, it's compatible with the existing type.

For the moment, I'll direct pint arrays to take the existing code path — I'm more confident that we don't want to special-case pint in the unstack routine.

@max-sixty
Copy link
Collaborator Author

I imagine I'm making some basic error here, but what's the best approach for evaluating whether an array is a pint array?

isinstance(self.data, unit_registry.Quantity) returns False, though seems to be what we do in test_units.py?

(Pdb) unit_registry
<pint.registry.UnitRegistry object at 0x13d4574c0>
(Pdb) unit_registry.Quantity
<class 'pint.quantity.build_quantity_class.<locals>.Quantity'>
(Pdb) isinstance(self.data, unit_registry.Quantity)
False
(Pdb) self.data
<Quantity([ 0.          0.20408163  0.40816327  0.6122449   0.81632653  1.02040816
  1.2244898   1.42857143  1.63265306  1.83673469  2.04081633  2.24489796
  2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  5.51020408  5.71428571  5.91836735
  6.12244898  6.32653061  6.53061224  6.73469388  6.93877551  7.14285714
  7.34693878  7.55102041  7.75510204  7.95918367  8.16326531  8.36734694
  8.57142857  8.7755102   8.97959184  9.18367347  9.3877551   9.59183673
  9.79591837 10.        ], 'meter')>
(Pdb) self.data.__class__
<class 'pint.quantity.build_quantity_class.<locals>.Quantity'>

@jthielen
Copy link
Contributor

jthielen commented Jan 2, 2021

I think it is best to check against the global pint.Quantity (isinstance(self.data, pint.Quantity)). This should work to check for a Quantity from any unit registry (rather than the other methods which are registry-specific since a class builder is used).

@max-sixty
Copy link
Collaborator Author

Thank you @jthielen !

@max-sixty
Copy link
Collaborator Author

I'm not sure whether I'm making some very basic mistake, but I'm seeing what seems like a very surprising error.

After the most recent commit, which seems to do very little apart from import pint iff it's available: b33aded, I'm getting a lot of pint errors, unrelated to unstack / stack.

Here's the results from the run prior:
https://dev.azure.com/xarray/xarray/_build/results?buildId=4650&view=ms.vss-test-web.build-test-results-tab

And from this run:
https://dev.azure.com/xarray/xarray/_build/results?buildId=4651&view=ms.vss-test-web.build-test-results-tab

Any ideas what's happening? As ever, 30% chance that I made a obvious typo that I can't see... Thanks in advance.

@max-sixty
Copy link
Collaborator Author

Until #4751 is resolved, I've taken out the explicit pint check and replaced with a numpy check.

The code is a bit messy now — now two levels of comments. But I've put references in, so it should be tractable. Lmk any feedback.

Otherwise this is ready to go from my end.

@max-sixty
Copy link
Collaborator Author

This still seems to be getting a bunch of pint failures, like this one: https://dev.azure.com/xarray/xarray/_build/results?buildId=4657&view=ms.vss-test-web.build-test-results-tab&runId=73796&resultId=110407&paneView=debug

I confused, since this PR now has no mention of pint and I don't see any mention of unstacking in those test failures. I suspect I'm missing something. Any ideas for what could be happening?

@max-sixty
Copy link
Collaborator Author

Tests now pass after merging master, not sure whether the previous tests were flaky vs something upstream changed...

Ready for a final review

@max-sixty
Copy link
Collaborator Author

As discussed in dev meeting, dask/dask#7033 would allow dask to use the fast path, and likely eventually for our existing path to be removed

@mrocklin
Copy link
Contributor

mrocklin commented Jan 6, 2021

If anyone here has time to review dask/dask#7033 that would be greatly appreciated :)

@max-sixty
Copy link
Collaborator Author

Any final comments before merging?

@max-sixty max-sixty changed the title 100x faster unstacking Faster unstacking Jan 14, 2021
@max-sixty
Copy link
Collaborator Author

I double-checked the benchmarks and added a pandas comparison. That involved ensuring the missing value was handled corre them and ensured the setup wasn't in the benchmark.

I don't get the 100x speed-up that I thought I saw initially; it's now more like 8x. Still decent! I'm not sure whether that's because I misread the benchmark previously or because the benchmarks are slightly different — I guess the first.

Pasting below the results so we have something concrete.

Existing

asv profile unstacking.Unstacking.time_unstack_slow master | head -n 20
··· unstacking.Unstacking.time_unstack_slow                   861±20ms

Proposed

asv profile unstacking.Unstacking.time_unstack_slow HEAD | head -n 20
··· unstacking.Unstacking.time_unstack_slow                    108±3ms

Pandas

asv profile unstacking.Unstacking.time_unstack_pandas_slow master | head -n 20
··· unstacking.Unstacking.time_unstack_pandas_slow            207±10ms

Are we OK with the claim vs pandas? I think it's important that we make accurate comparisons (both good and bad) but open-minded if it seems a bit aggressive. Worth someone reviewing the code in the benchmark to ensure I haven't made a mistake.

@max-sixty
Copy link
Collaborator Author

Would anyone know whether the docs failure is related to this PR? I can't see anything in the log apart from matplotlib warnings? https://readthedocs.org/projects/xray/builds/12768566/

@keewis
Copy link
Collaborator

keewis commented Jan 14, 2021

it definitely is not: look at the builds of #4809: one build failed and the next (about 12 minutes later) fails. I guess either sphinx, nbsphinx or one of their dependencies got updated on conda-forge. nbformat (5.0.85.1.0), maybe?

Edit: nbformat=5.1.2 is out which should fix that issue.

@max-sixty
Copy link
Collaborator Author

Any final feedback before merging?

@max-sixty
Copy link
Collaborator Author

Let me know any post-merge feedback and I'll make the changes

@max-sixty max-sixty merged commit a0c71c1 into pydata:master Jan 24, 2021
@max-sixty max-sixty deleted the fast-unstack branch January 24, 2021 23:48
dcherian added a commit to TomNicholas/xarray that referenced this pull request Jan 29, 2021
* master:
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  Allow swap_dims to take kwargs (pydata#4841)
  Move skip ci instructions to contributing guide (pydata#4829)
  fix issues in drop_sel and drop_isel (pydata#4828)
dcherian added a commit to dcherian/xarray that referenced this pull request Jan 29, 2021
* upstream/master:
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  Allow swap_dims to take kwargs (pydata#4841)
  Move skip ci instructions to contributing guide (pydata#4829)
  fix issues in drop_sel and drop_isel (pydata#4828)
  Bugfix in list_engine (pydata#4811)
  Add drop_isel (pydata#4819)
  Fix RST.
  Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
dcherian added a commit to dcherian/xarray that referenced this pull request Feb 3, 2021
* master: (458 commits)
  Add units if "unit" is in the attrs. (pydata#4850)
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  Allow swap_dims to take kwargs (pydata#4841)
  Move skip ci instructions to contributing guide (pydata#4829)
  fix issues in drop_sel and drop_isel (pydata#4828)
  Bugfix in list_engine (pydata#4811)
  Add drop_isel (pydata#4819)
  Fix RST.
  Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
  fix decode for scale/ offset list (pydata#4802)
  Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795)
  add a version info step to the upstream-dev CI (pydata#4815)
  fix the ci trigger action (pydata#4805)
  scatter plot by order of the first appearance of hue (pydata#4723)
  ...
dcherian added a commit to DWesl/xarray that referenced this pull request Feb 11, 2021
…_and_bounds_as_coords

* upstream/master: (51 commits)
  Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758)
  Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855)
  add a drop_conflicts strategy for merging attrs (pydata#4827)
  update pre-commit hooks (mypy) (pydata#4883)
  ensure warnings cannot become errors in assert_ (pydata#4864)
  update pre-commit hooks (pydata#4874)
  small fixes for the docstrings of swap_dims and integrate (pydata#4867)
  Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871)
  vélin (pydata#4872)
  don't skip the doctests CI (pydata#4869)
  fix da.pad example for numpy 1.20 (pydata#4865)
  temporarily pin dask (pydata#4873)
  Add units if "unit" is in the attrs. (pydata#4850)
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  ...
@max-sixty max-sixty mentioned this pull request Jul 5, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants