-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster unstacking #4746
Faster unstacking #4746
Conversation
Any ideas on how sparse arrays should be handled in In the new code, we're creating an array with |
83a3b89
to
a1e37b9
Compare
Very nice! I think we solve the issue with sparse by using The bigger challenge that I'm concerned about here are dask arrays, which don't support array assignment like this at all (dask/dask#2000). We will probably need to keep around the slower option for dask, at least for now. |
Great, thanks, I'm making that change. Is there any need to keep the |
Would anyone be able be familiar with this pint error? https://dev.azure.com/xarray/xarray/_build/results?buildId=4650&view=ms.vss-test-web.build-test-results-tab&runId=73654&resultId=111454&paneView=debug It seems to be failing on the assignment: Here's the stack trace from there: /home/vsts/work/1/s/xarray/core/variable.py:1627: in _unstack_once
data[(..., *indexer)] = reordered
/home/vsts/work/1/s/xarray/core/common.py:131: in __array__
return np.asarray(self.values, dtype=dtype)
/home/vsts/work/1/s/xarray/core/variable.py:543: in values
return _as_array_or_item(self._data)
/home/vsts/work/1/s/xarray/core/variable.py:275: in _as_array_or_item
data = np.asarray(data)
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/numpy/core/_asarray.py:83: in asarray
return array(a, dtype, copy=False, order=order)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Quantity([ 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4
4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9
9 10], 'meter')>
t = None
def __array__(self, t=None):
> warnings.warn(
"The unit of the quantity is stripped when downcasting to ndarray.",
UnitStrippedWarning,
stacklevel=2,
)
E pint.errors.UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
/usr/share/miniconda/envs/xarray-tests/lib/python3.8/site-packages/pint/quantity.py:1683: UnitStrippedWarning Worst case, I can direct pint arrays to the existing unstack path, but ideally this would work. |
@keewis any thoughts on the pint issue? |
@max-sixty That error seems to be arising because Perhaps relevant to the discussion is this Pint issue hgrecco/pint#882 where it was tentatively decided that Pint's wrapped implementation of |
Thanks for responding @jthielen . Yes, so
I do think this is a bit surprising — while For the moment, I'll direct pint arrays to take the existing code path — I'm more confident that we don't want to special-case pint in the |
I imagine I'm making some basic error here, but what's the best approach for evaluating whether an array is a pint array?
|
I think it is best to check against the global |
Thank you @jthielen ! |
I'm not sure whether I'm making some very basic mistake, but I'm seeing what seems like a very surprising error. After the most recent commit, which seems to do very little apart from import Here's the results from the run prior: And from this run: Any ideas what's happening? As ever, 30% chance that I made a obvious typo that I can't see... Thanks in advance. |
This reverts commit b33aded.
Until #4751 is resolved, I've taken out the explicit pint check and replaced with a numpy check. The code is a bit messy now — now two levels of comments. But I've put references in, so it should be tractable. Lmk any feedback. Otherwise this is ready to go from my end. |
88eb42e
to
e2ba144
Compare
This still seems to be getting a bunch of pint failures, like this one: https://dev.azure.com/xarray/xarray/_build/results?buildId=4657&view=ms.vss-test-web.build-test-results-tab&runId=73796&resultId=110407&paneView=debug I confused, since this PR now has no mention of |
Tests now pass after merging master, not sure whether the previous tests were flaky vs something upstream changed... Ready for a final review |
As discussed in dev meeting, dask/dask#7033 would allow dask to use the fast path, and likely eventually for our existing path to be removed |
If anyone here has time to review dask/dask#7033 that would be greatly appreciated :) |
Any final comments before merging? |
I double-checked the benchmarks and added a pandas comparison. That involved ensuring the missing value was handled corre them and ensured the setup wasn't in the benchmark. I don't get the 100x speed-up that I thought I saw initially; it's now more like 8x. Still decent! I'm not sure whether that's because I misread the benchmark previously or because the benchmarks are slightly different — I guess the first. Pasting below the results so we have something concrete. Existing
Proposed
Pandas
Are we OK with the claim vs pandas? I think it's important that we make accurate comparisons (both good and bad) but open-minded if it seems a bit aggressive. Worth someone reviewing the code in the benchmark to ensure I haven't made a mistake. |
Would anyone know whether the docs failure is related to this PR? I can't see anything in the log apart from matplotlib warnings? https://readthedocs.org/projects/xray/builds/12768566/ |
Any final feedback before merging? |
Let me know any post-merge feedback and I'll make the changes |
* master: WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828)
* upstream/master: speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
* master: (458 commits) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809) fix decode for scale/ offset list (pydata#4802) Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795) add a version info step to the upstream-dev CI (pydata#4815) fix the ci trigger action (pydata#4805) scatter plot by order of the first appearance of hue (pydata#4723) ...
…_and_bounds_as_coords * upstream/master: (51 commits) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) ...
asv profile unstacking.Unstacking.time_unstack_slow
runs in 10ms now, down from 1076ms.Todos:
Variable.unstack
wouldn't work with the new code, since it requires the index which is being unstacked, and variables don't have indexes. The existing function will fail on any missing indexes, but may still be useful._fast
from names where possible / remove some dead commentsisort . && black . && mypy . && flake8
whats-new.rst