Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor index vs. coordinate variable(s) #5636

Merged
merged 14 commits into from
Aug 9, 2021

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Jul 26, 2021

This implements option 3 (sort of) described in #5553 (comment):

  • the goal is to avoid wrapping an xarray.Index into an xarray.Variable and keep those two concepts distinct from each other.
  • the xarray.Index.from_variables class constructor accepts a dictionary of xarray.Variable objects as argument and may (or should?) also return corresponding xarray.IndexVariable objects to ensure immutability.
  • for PandasIndex, the new returned xarray.IndexVariable wraps the underlying pd.Index via a PandasIndexingAdapter (this reverts some changes made in Flexible indexes: add Index base class and xindexes properties #5102).
  • for PandasMultiIndex, this PR adds PandasMultiIndexingAdapter so that we can wrap the pandas multi-index in separate coordinate variables objects: one for the dimension + one for each level. The level coordinates data internally hold a reference to the dimension coordinate data to avoid indexing the same underlying pd.MultiIndex for each of those coordinates (PandasMultiIndexingAdapter.__getitem__ is memoized for that purpose).

This is very much work in progress, I need to update (or revert) all related parts of Xarray's internals, update tests, etc. At this stage any comment on the approach described above is welcome.

- Pass Variable objects to xarray.Index constructor
- The index should create IndexVariable objects (`coords` attribute)
- PandasIndex: IndexVariable wraps PandasIndexingAdpater wraps pd.Index
@github-actions
Copy link
Contributor

github-actions bot commented Jul 26, 2021

Unit Test Results

         6 files  ±0           6 suites  ±0   56m 17s ⏱️ ±0s
16 210 tests ±0  14 479 ✔️ ±0  1 731 💤 ±0  0 ❌ ±0 
90 456 runs  ±0  82 284 ✔️ ±0  8 172 💤 ±0  0 ❌ ±0 

Results for commit 4bb9d9c. ± Comparison against base commit 4bb9d9c.

♻️ This comment has been updated with latest results.

@benbovy benbovy marked this pull request as draft July 27, 2021 12:14
xarray/core/indexes.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

Hello @benbovy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 10:1: F401 'xarray.core.indexes.PandasIndex' imported but unused

xarray/core/parallel.py Outdated Show resolved Hide resolved
@benbovy
Copy link
Member Author

benbovy commented Jul 30, 2021

This is now ready for review!

The new features (e.g., PandasMultiIndexingAdapter) are not yet integrated with the rest of the code, but it's probably better to do that in a follow-up PR as this one is getting pretty big.

When cleaning up things I noticed that, e.g., for a dataset ds with one dimension coordinate “x”, the corresponding variable ds.variables[‘x’] may wrap a different (pandas) index object than ds.indexes[‘x’] (even though both are equal). I don’t think this has an impact on the cache defined for multi-index coordinates in this PR (when created all these coordinates wrap the same underlying pandas index), but in general this is probably sub-optimal. Maybe we could avoid indexing the indexes twice here too?

There are also some places where the indexes are treated as arrays and that probably need better refactoring than the dirty fixes made here. This is the case for align (see #5647) and map_blocks (see #5636 (comment)).

@benbovy benbovy marked this pull request as ready for review July 30, 2021 11:02
Dataset constructor doesn't accept xarray indexes yet. Create new
coordinates from the underlying pandas indexes.
@benbovy
Copy link
Member Author

benbovy commented Aug 5, 2021

During the last meeting on flexible indexes #5452 we discussed (and agreed) about the general approach implemented here for xarray.Index. Nothing is settled yet regarding new public API or public-facing changes, though.

Thanks to everyone who attended to the meeting, and thanks @shoyer and @dcherian for the review/help here.

If there's no objection I'm going to merge this PR tomorrow so that I can move forward with the refactoring.

@mathause mathause mentioned this pull request Aug 5, 2021
4 tasks
@benbovy benbovy merged commit 4bb9d9c into pydata:main Aug 9, 2021
st-bender added a commit to st-bender/xarray that referenced this pull request Aug 10, 2021
* main: (31 commits)
  Refactor index vs. coordinate variable(s) (pydata#5636)
  pre-commit: autoupdate hook versions (pydata#5685)
  Flexible Indexes: Avoid len(index) in map_blocks (pydata#5670)
  Speed up _mapping_repr (pydata#5661)
  update the link to `scipy`'s intersphinx file (pydata#5665)
  Bump styfle/cancel-workflow-action from 0.9.0 to 0.9.1 (pydata#5663)
  pre-commit: autoupdate hook versions (pydata#5660)
  fix the binder environment (pydata#5650)
  Update api.rst (pydata#5639)
  Kwargs to rasterio open (pydata#5609)
  Bump codecov/codecov-action from 1 to 2.0.2 (pydata#5633)
  new blank whats-new for v0.19.1
  v0.19.0 release notes (pydata#5632)
  remove deprecations scheduled for 0.19 (pydata#5630)
  Make typing-extensions optional (pydata#5624)
  Plots get labels from pint arrays (pydata#5561)
  Add to_numpy() and as_numpy() methods (pydata#5568)
  pin fsspec (pydata#5627)
  pre-commit: autoupdate hook versions (pydata#5617)
  Add dataarray scatter with 3d support (pydata#4909)
  ...
dcherian added a commit to kmsquire/xarray that referenced this pull request Aug 11, 2021
* upstream/main: (31 commits)
  Refactor index vs. coordinate variable(s) (pydata#5636)
  pre-commit: autoupdate hook versions (pydata#5685)
  Flexible Indexes: Avoid len(index) in map_blocks (pydata#5670)
  Speed up _mapping_repr (pydata#5661)
  update the link to `scipy`'s intersphinx file (pydata#5665)
  Bump styfle/cancel-workflow-action from 0.9.0 to 0.9.1 (pydata#5663)
  pre-commit: autoupdate hook versions (pydata#5660)
  fix the binder environment (pydata#5650)
  Update api.rst (pydata#5639)
  Kwargs to rasterio open (pydata#5609)
  Bump codecov/codecov-action from 1 to 2.0.2 (pydata#5633)
  new blank whats-new for v0.19.1
  v0.19.0 release notes (pydata#5632)
  remove deprecations scheduled for 0.19 (pydata#5630)
  Make typing-extensions optional (pydata#5624)
  Plots get labels from pint arrays (pydata#5561)
  Add to_numpy() and as_numpy() methods (pydata#5568)
  pin fsspec (pydata#5627)
  pre-commit: autoupdate hook versions (pydata#5617)
  Add dataarray scatter with 3d support (pydata#4909)
  ...
@benbovy benbovy mentioned this pull request Aug 11, 2021
54 tasks
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 13, 2021
* upstream/main: (34 commits)
  Use same bool validator as other inputs (pydata#5703)
  conditionally disable bottleneck (pydata#5560)
  Refactor index vs. coordinate variable(s) (pydata#5636)
  pre-commit: autoupdate hook versions (pydata#5685)
  Flexible Indexes: Avoid len(index) in map_blocks (pydata#5670)
  Speed up _mapping_repr (pydata#5661)
  update the link to `scipy`'s intersphinx file (pydata#5665)
  Bump styfle/cancel-workflow-action from 0.9.0 to 0.9.1 (pydata#5663)
  pre-commit: autoupdate hook versions (pydata#5660)
  fix the binder environment (pydata#5650)
  Update api.rst (pydata#5639)
  Kwargs to rasterio open (pydata#5609)
  Bump codecov/codecov-action from 1 to 2.0.2 (pydata#5633)
  new blank whats-new for v0.19.1
  v0.19.0 release notes (pydata#5632)
  remove deprecations scheduled for 0.19 (pydata#5630)
  Make typing-extensions optional (pydata#5624)
  Plots get labels from pint arrays (pydata#5561)
  Add to_numpy() and as_numpy() methods (pydata#5568)
  pin fsspec (pydata#5627)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 13, 2021
* upstream/main: (307 commits)
  Use same bool validator as other inputs (pydata#5703)
  conditionally disable bottleneck (pydata#5560)
  Refactor index vs. coordinate variable(s) (pydata#5636)
  pre-commit: autoupdate hook versions (pydata#5685)
  Flexible Indexes: Avoid len(index) in map_blocks (pydata#5670)
  Speed up _mapping_repr (pydata#5661)
  update the link to `scipy`'s intersphinx file (pydata#5665)
  Bump styfle/cancel-workflow-action from 0.9.0 to 0.9.1 (pydata#5663)
  pre-commit: autoupdate hook versions (pydata#5660)
  fix the binder environment (pydata#5650)
  Update api.rst (pydata#5639)
  Kwargs to rasterio open (pydata#5609)
  Bump codecov/codecov-action from 1 to 2.0.2 (pydata#5633)
  new blank whats-new for v0.19.1
  v0.19.0 release notes (pydata#5632)
  remove deprecations scheduled for 0.19 (pydata#5630)
  Make typing-extensions optional (pydata#5624)
  Plots get labels from pint arrays (pydata#5561)
  Add to_numpy() and as_numpy() methods (pydata#5568)
  pin fsspec (pydata#5627)
  ...
@benbovy benbovy deleted the refactor-indexes-vs-coordinates branch August 30, 2023 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flexible indexes: how best to implement the new data model?
4 participants