Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal refactor of label-based data selection #5322

Merged
merged 13 commits into from
Jun 8, 2021

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented May 17, 2021

Xarray label-based data selection now relies on a newly added xarray.Index.query(self, labels: Dict[Hashable, Any]) -> Tuple[Any, Optional[None, Index]] method where:

  • labels is a always a dictionary with coordinate name(s) as key(s) and the corresponding selection label(s) as values
  • When calling .sel with some coordinate(s)/label(s) pairs, those are first grouped by index so that only the relevant pairs are passed to an Index.query
  • the returned tuple contains the positional indexers and (optionally) a new index object

For a simple pd.Index, labels always corresponds to a 1-item dictionary like {'coord_name': label_values}, which is not very useful in this case, but this format is useful for pd.MultiIndex and will likely be for other, custom indexes.

Moving the label->positional indexer conversion logic into PandasIndex.query(), I've tried to separate pd.Index vs pd.MultiIndex concerns by adding a new PandasMultiIndex wrapper class (it will probably be useful for other things as well) and refactor the complex logic that was implemented in convert_label_indexer. Hopefully it is a bit clearer now.

Working towards a more flexible/generic system, we still need to figure out how to:

  • pass index query extra arguments like method and tolerance for pd.Index but in a more generic way
  • handle several positional indexers over multiple dimensions possibly returned by a custom "meta-index" (e.g., staggered grid index)
  • handle the case of positional indexers returned from querying >1 indexes along the same dimension (e.g., multiple coordinates along x with a simple pd.Index)
  • pandas indexes don't need information like the names or shapes of their corresponding coordinate(s) to perform label-based selection, but this kind of information will probably be needed for other indexes (we actually need it for advanced point-wise selection using tree-based indexes in xoak).

This could be done in follow-up PRs..

Side note: I've initially tried to return from xindexes items for multi-index levels as well (not only index dimensions), but it's probably wiser to save this for later (when we'll tackle the multi-index virtual coordinate refactoring) as there are many places in Xarray where this is clearly not expected.

Happy to hear your thoughts @pydata/xarray.

add Index.query() method
split pd.Index vs. pd.MultiIndex logic
This reverts commit 261fb78.

Let's keep this for later. There are too many places in Xarray that
assume that xindexes keys are dimension names.
Add it later in the refactoring when it will be needed elsewhere (e.g., in ``set_index``).
Is this what we want?
xarray/tests/test_indexes.py Show resolved Hide resolved
xarray/core/indexes.py Show resolved Hide resolved
from .variable import Variable

if isinstance(x, (Variable, DataArray)):
x = x.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compute lazy values. Is that what we want here? Or should we do like this?

Suggested change
x = x.values
x = x.data

Copy link
Member Author

@benbovy benbovy May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is the use case for this, but I don't think many Xarray users have already tried providing to .sel() slice objects with elements that consist of large Variable or DataArray objects?

We could still check their dimensions (ndim == 0) before coercing it as Numpy arrays if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only comes up if somebody uses a slice object, e.g., ds.sel(x=slice(i, j)). In this case i and j would be loaded into memory.

I think this is probably OK (it is the existing behavior) but on the other hand it would also probably be slightly safer to use .data, which would not implicitly convert a dask array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using .data wouldn't work with the code below, which expects a numpy array. Is there an easy way to get the scalar value from a dask or any duck array? I don't think so? dask/dask#2961. But I guess we could check len(np.shape(x)) != 0 and raise before coercing to a numpy array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess we could check len(np.shape(x)) != 0 and raise before coercing to a numpy array?

Hmm checking this for any x wouldn't work with multi-index and slice objects with tuple elements:

idx = pd.MultiIndex.from_tuples([(0, 'a'), (1, 'b')]
da = xr.DataArray(idx)
da.sel(dim_0=slice((0, 'a'), (1, 'b')))

np.shape((0, 'a'))  # returns (2,)

@benbovy
Copy link
Member Author

benbovy commented May 28, 2021

Thanks everyone for your comments on this PR!

I'll leave this PR open for another few days in case someone has not yet had a chance to review it. Otherwise, if that's ok I plan to merge it sometime early next week so that I can move forward with the refactoring. In the next step, I'd like to try and see if we can update the data model (i.e., break the "1-dimension <-> 1-index" relationship) and get rid of multi-index virtual coordinates (use real coordinates instead) with minimal user-facing changes (or ideally none).

@benbovy benbovy merged commit 9daf9b1 into pydata:master Jun 8, 2021
@benbovy benbovy deleted the index-refactor-selection branch March 29, 2022 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants