speed up the repr for big MultiIndex objects#4846
Conversation
xarray/core/formatting.py
Outdated
| if col_width < len(coord): | ||
| n_values = col_width // 4 | ||
| indices = list(range(0, n_values)) + list(range(-n_values, 0)) | ||
| subset = coord[indices] | ||
| else: | ||
| subset = coord |
There was a problem hiding this comment.
this could probably use some optimization: how big does the MultiIndex have to be so indexing+get_level_variable is faster than just get_level_variable?
There was a problem hiding this comment.
Yeah, though it's so fast already relative to how often repr is called...
Could also defer to pandas, which seem to do this (though a different orientation)
More than fine to leave as a TODO imo
There was a problem hiding this comment.
I assumed that for everything below 100 elements get_level_variable is faster than indexing first, which also means that I don't have to worry about the case where the index does not have enough elements to be truncated.
Could also defer to pandas
how would I do that?
There was a problem hiding this comment.
how would I do that?
I had meant — they have a repr for multiindex which is fast — so could we use theirs somehow, despite the different orientation. On reflection — our code is simple, I agree with your impulse.
|
This is great, thanks! We could add an ASV if you like, but also fine to leave for another day / PR |
done, I think? |
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
|
Thanks @keewis and @max-sixty |
* upstream/master: speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
* master: (458 commits) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809) fix decode for scale/ offset list (pydata#4802) Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795) add a version info step to the upstream-dev CI (pydata#4815) fix the ci trigger action (pydata#4805) scatter plot by order of the first appearance of hue (pydata#4723) ...
…_and_bounds_as_coords * upstream/master: (51 commits) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) ...
* upstream/master: (24 commits) Compatibility with dask 2021.02.0 (pydata#4884) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) ...
I'm not able to check if this actually works for bigger arrays, but with
xr.DataArray(pd.Series(range(25_000_000), index=idx))I get a significant speed-up of about 180x forrepr.pre-commit run --all-fileswhats-new.rstapi.rst