Faster unstacking #4746

max-sixty · 2020-12-31T22:55:26Z

asv profile unstacking.Unstacking.time_unstack_slow runs in 10ms now, down from 1076ms.

Todos:

Resolve how to handle sparse arrays — they currently go through the existing functions, and we have one test failure
Decide what to do with the existing functions — in particular Variable.unstack wouldn't work with the new code, since it requires the index which is being unstacked, and variables don't have indexes. The existing function will fail on any missing indexes, but may still be useful.
Clean up the code a bit — remove _fast from names where possible / remove some dead comments
Passes isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst

max-sixty · 2020-12-31T23:38:33Z

Any ideas on how sparse arrays should be handled in unstack? Currently we use reindex, so this seems to pass through without much effort on our part.

In the new code, we're creating an array with np.full and then assigning to the appropriate locations. Can we do something similar that's not dependent on the underlying numpy / sparse / dask storage?

shoyer · 2021-01-01T07:52:14Z

Very nice!

I think we solve the issue with sparse by using np.full_like, which does dispatching on NumPy 1.17+ via NEP-18 __array_function__ (which I'm pretty sure sparse supports).

The bigger challenge that I'm concerned about here are dask arrays, which don't support array assignment like this at all (dask/dask#2000). We will probably need to keep around the slower option for dask, at least for now.

max-sixty · 2021-01-01T22:03:19Z

Great, thanks, I'm making that change.

Is there any need to keep the sparse kwarg? My inclination is to remove it and retain types — so to get a sparse array back, convert to sparse before unstacking?

max-sixty · 2021-01-02T00:03:00Z

Would anyone be able be familiar with this pint error? https://dev.azure.com/xarray/xarray/_build/results?buildId=4650&view=ms.vss-test-web.build-test-results-tab&runId=73654&resultId=111454&paneView=debug

It seems to be failing on the assignment: data[(..., *indexer)] = reordered, rather than anything specific to unstacking.

Here's the stack trace from there:

/home/vsts/work/1/s/xarray/core/variable.py:1627: in _unstack_once
    data[(..., *indexer)] = reordered
/home/vsts/work/1/s/xarray/core/common.py:131: in __array__
    return np.asarray(self.values, dtype=dtype)
/home/vsts/work/1/s/xarray/core/variable.py:543: in values
    return _as_array_or_item(self._data)
/home/vsts/work/1/s/xarray/core/variable.py:275: in _as_array_or_item
    data = np.asarray(data)
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/numpy/core/_asarray.py:83: in asarray
    return array(a, dtype, copy=False, order=order)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Quantity([ 0  0  0  0  0  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4
  4  5  5  5  5  5  6  6  6  6  6  7  7  7  7  7  8  8  8  8  8  9  9  9
  9 10], 'meter')>
t = None

    def __array__(self, t=None):
>       warnings.warn(
            "The unit of the quantity is stripped when downcasting to ndarray.",
            UnitStrippedWarning,
            stacklevel=2,
        )
E       pint.errors.UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.

/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/pint/quantity.py:1683: UnitStrippedWarning

Worst case, I can direct pint arrays to the existing unstack path, but ideally this would work.

shoyer · 2021-01-02T02:00:19Z

@keewis any thoughts on the pint issue?

jthielen · 2021-01-02T03:15:56Z

@max-sixty That error seems to be arising because data is an numpy.ndarray and reordered is an xarray.Variable with a pint.Quantity inside, hence trying to assign a unit-aware array to an ndarray causes the units to be stripped.

Perhaps relevant to the discussion is this Pint issue hgrecco/pint#882 where it was tentatively decided that Pint's wrapped implementation of np.full_like would base its units off of fill_value alone, so as it is now, I think _unstack_once would need to create a Quantity-wrapped nan in the units of self.data for Pint to create an Quantity filled with nans (if I'm interpreting the implementation here correctly and that is what is needed). This seems like something where units-on-the-dtype would be useful, but alas, things aren't there yet!

max-sixty · 2021-01-02T03:30:35Z

Thanks for responding @jthielen .

Yes, so full_like isn't creating a pint array:

(Pdb) self.data
<Quantity([ 0.          0.20408163  0.40816327  0.6122449   0.81632653  1.02040816
  1.2244898   1.42857143  1.63265306  1.83673469  2.04081633  2.24489796
  2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  5.51020408  5.71428571  5.91836735
  6.12244898  6.32653061  6.53061224  6.73469388  6.93877551  7.14285714
  7.34693878  7.55102041  7.75510204  7.95918367  8.16326531  8.36734694
  8.57142857  8.7755102   8.97959184  9.18367347  9.3877551   9.59183673
  9.79591837 10.        ], 'meter')>
(Pdb) np.full_like(self.data, fill_value=fill_value)
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
(Pdb) fill_value
nan

I do think this is a bit surprising — while fill_value isn't typed, it's compatible with the existing type.

For the moment, I'll direct pint arrays to take the existing code path — I'm more confident that we don't want to special-case pint in the unstack routine.

max-sixty · 2021-01-02T03:43:06Z

I imagine I'm making some basic error here, but what's the best approach for evaluating whether an array is a pint array?

isinstance(self.data, unit_registry.Quantity) returns False, though seems to be what we do in test_units.py?

(Pdb) unit_registry
<pint.registry.UnitRegistry object at 0x13d4574c0>
(Pdb) unit_registry.Quantity
<class 'pint.quantity.build_quantity_class.<locals>.Quantity'>
(Pdb) isinstance(self.data, unit_registry.Quantity)
False
(Pdb) self.data
<Quantity([ 0.          0.20408163  0.40816327  0.6122449   0.81632653  1.02040816
  1.2244898   1.42857143  1.63265306  1.83673469  2.04081633  2.24489796
  2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  5.51020408  5.71428571  5.91836735
  6.12244898  6.32653061  6.53061224  6.73469388  6.93877551  7.14285714
  7.34693878  7.55102041  7.75510204  7.95918367  8.16326531  8.36734694
  8.57142857  8.7755102   8.97959184  9.18367347  9.3877551   9.59183673
  9.79591837 10.        ], 'meter')>
(Pdb) self.data.__class__
<class 'pint.quantity.build_quantity_class.<locals>.Quantity'>

jthielen · 2021-01-02T03:49:28Z

I think it is best to check against the global pint.Quantity (isinstance(self.data, pint.Quantity)). This should work to check for a Quantity from any unit registry (rather than the other methods which are registry-specific since a class builder is used).

max-sixty · 2021-01-02T03:56:19Z

Thank you @jthielen !

max-sixty · 2021-01-02T04:37:26Z

I'm not sure whether I'm making some very basic mistake, but I'm seeing what seems like a very surprising error.

After the most recent commit, which seems to do very little apart from import pint iff it's available: b33aded, I'm getting a lot of pint errors, unrelated to unstack / stack.

Here's the results from the run prior:
https://dev.azure.com/xarray/xarray/_build/results?buildId=4650&view=ms.vss-test-web.build-test-results-tab

And from this run:
https://dev.azure.com/xarray/xarray/_build/results?buildId=4651&view=ms.vss-test-web.build-test-results-tab

Any ideas what's happening? As ever, 30% chance that I made a obvious typo that I can't see... Thanks in advance.

This reverts commit b33aded.

max-sixty · 2021-01-03T22:14:17Z

Until #4751 is resolved, I've taken out the explicit pint check and replaced with a numpy check.

The code is a bit messy now — now two levels of comments. But I've put references in, so it should be tractable. Lmk any feedback.

Otherwise this is ready to go from my end.

max-sixty · 2021-01-04T06:11:17Z

This still seems to be getting a bunch of pint failures, like this one: https://dev.azure.com/xarray/xarray/_build/results?buildId=4657&view=ms.vss-test-web.build-test-results-tab&runId=73796&resultId=110407&paneView=debug

I confused, since this PR now has no mention of pint and I don't see any mention of unstacking in those test failures. I suspect I'm missing something. Any ideas for what could be happening?

max-sixty · 2021-01-05T05:14:27Z

Tests now pass after merging master, not sure whether the previous tests were flaky vs something upstream changed...

Ready for a final review

max-sixty · 2021-01-06T16:48:30Z

As discussed in dev meeting, dask/dask#7033 would allow dask to use the fast path, and likely eventually for our existing path to be removed

mrocklin · 2021-01-06T16:50:12Z

If anyone here has time to review dask/dask#7033 that would be greatly appreciated :)

max-sixty · 2021-01-11T18:37:48Z

Any final comments before merging?

xarray/core/dataset.py

max-sixty · 2021-01-14T23:05:16Z

I double-checked the benchmarks and added a pandas comparison. That involved ensuring the missing value was handled corre them and ensured the setup wasn't in the benchmark.

I don't get the 100x speed-up that I thought I saw initially; it's now more like 8x. Still decent! I'm not sure whether that's because I misread the benchmark previously or because the benchmarks are slightly different — I guess the first.

Pasting below the results so we have something concrete.

Existing

asv profile unstacking.Unstacking.time_unstack_slow master | head -n 20
··· unstacking.Unstacking.time_unstack_slow                   861±20ms

Proposed

asv profile unstacking.Unstacking.time_unstack_slow HEAD | head -n 20
··· unstacking.Unstacking.time_unstack_slow                    108±3ms

Pandas

asv profile unstacking.Unstacking.time_unstack_pandas_slow master | head -n 20
··· unstacking.Unstacking.time_unstack_pandas_slow            207±10ms

Are we OK with the claim vs pandas? I think it's important that we make accurate comparisons (both good and bad) but open-minded if it seems a bit aggressive. Worth someone reviewing the code in the benchmark to ensure I haven't made a mistake.

max-sixty · 2021-01-14T23:07:16Z

Would anyone know whether the docs failure is related to this PR? I can't see anything in the log apart from matplotlib warnings? https://readthedocs.org/projects/xray/builds/12768566/

keewis · 2021-01-14T23:18:16Z

it definitely is not: look at the builds of #4809: one build failed and the next (about 12 minutes later) fails. I guess either sphinx, nbsphinx or one of their dependencies got updated on conda-forge. nbformat (5.0.8 → 5.1.0), maybe?

Edit: nbformat=5.1.2 is out which should fix that issue.

max-sixty · 2021-01-20T20:12:36Z

Any final feedback before merging?

max-sixty · 2021-01-24T23:48:00Z

Let me know any post-merge feedback and I'll make the changes

* master: WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828)

* upstream/master: speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)

* master: (458 commits) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809) fix decode for scale/ offset list (pydata#4802) Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795) add a version info step to the upstream-dev CI (pydata#4815) fix the ci trigger action (pydata#4805) scatter plot by order of the first appearance of hue (pydata#4723) ...

…_and_bounds_as_coords * upstream/master: (51 commits) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) ...

Significantly improve unstacking performance

510a6e1

Hack to get sparse tests passing

a1e37b9

max-sixty force-pushed the fast-unstack branch from 83a3b89 to a1e37b9 Compare December 31, 2020 23:49

max-sixty mentioned this pull request Jan 1, 2021

sparse doesn't support shape kwarg in full_like pydata/sparse#422

Closed

max-sixty added 3 commits January 1, 2021 14:55

Use the existing unstack function for dask & sparse

e64e4dd

Add whatsnew

ff8f5b0

Require numpy 1.17 for new unstack

1c150fc

Also special case pint

b33aded

max-sixty mentioned this pull request Jan 2, 2021

Add item to check for pint array type #4751

Closed

4 tasks

Revert "Also special case pint"

9a53fec

This reverts commit b33aded.

Only run fast unstack on numpy arrays

e2ba144

max-sixty force-pushed the fast-unstack branch from 88eb42e to e2ba144 Compare January 3, 2021 22:48

max-sixty added 2 commits January 3, 2021 22:06

Merge branch 'master' into fast-unstack

12ec560

Merge remote-tracking branch 'upstream' into fast-unstack

a4dfe49

Merge branch 'master' into fast-unstack

e73ab6d

Merge branch 'master' into fast-unstack

8da254d

dcherian reviewed Jan 14, 2021

View reviewed changes

xarray/core/dataset.py Show resolved Hide resolved

max-sixty added 2 commits January 14, 2021 14:48

Update asvs for unstacking

c280cd9

Update whatsnew

be76382

max-sixty changed the title ~~100x faster unstacking~~ Faster unstacking Jan 14, 2021

Merge branch 'master' into fast-unstack

ac1dd82

Merge branch 'master' into fast-unstack

533db67

max-sixty merged commit a0c71c1 into pydata:master Jan 24, 2021

max-sixty deleted the fast-unstack branch January 24, 2021 23:48

max-sixty mentioned this pull request Jan 31, 2021

Unreasonable default fill_values ml31415/numpy-groupies#32

Open

Illviljan mentioned this pull request May 15, 2021

Use _unstack_once for valid dask and sparse versions #5315

Merged

2 tasks

max-sixty mentioned this pull request Jul 5, 2021

Faster unstacking to sparse #5577

Merged

3 tasks

Uh oh!

Faster unstacking #4746

Faster unstacking #4746

Uh oh!

Conversation

max-sixty commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty commented Dec 31, 2020

Uh oh!

shoyer commented Jan 1, 2021

Uh oh!

max-sixty commented Jan 1, 2021

Uh oh!

max-sixty commented Jan 2, 2021

Uh oh!

shoyer commented Jan 2, 2021

Uh oh!

jthielen commented Jan 2, 2021

Uh oh!

max-sixty commented Jan 2, 2021

Uh oh!

max-sixty commented Jan 2, 2021

Uh oh!

jthielen commented Jan 2, 2021

Uh oh!

max-sixty commented Jan 2, 2021

Uh oh!

max-sixty commented Jan 2, 2021

Uh oh!

max-sixty commented Jan 3, 2021

Uh oh!

max-sixty commented Jan 4, 2021

Uh oh!

max-sixty commented Jan 5, 2021

Uh oh!

max-sixty commented Jan 6, 2021

Uh oh!

mrocklin commented Jan 6, 2021

Uh oh!

max-sixty commented Jan 11, 2021

Uh oh!

Uh oh!

max-sixty commented Jan 14, 2021

Uh oh!

max-sixty commented Jan 14, 2021

Uh oh!

keewis commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty commented Jan 20, 2021

Uh oh!

max-sixty commented Jan 24, 2021

Uh oh!

Uh oh!

max-sixty commented Dec 31, 2020 •

edited

Loading

keewis commented Jan 14, 2021 •

edited

Loading