-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal refactor of label-based data selection #5322
Conversation
add Index.query() method split pd.Index vs. pd.MultiIndex logic
This reverts commit 261fb78. Let's keep this for later. There are too many places in Xarray that assume that xindexes keys are dimension names.
Add it later in the refactoring when it will be needed elsewhere (e.g., in ``set_index``).
Is this what we want?
from .variable import Variable | ||
|
||
if isinstance(x, (Variable, DataArray)): | ||
x = x.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will compute lazy values. Is that what we want here? Or should we do like this?
x = x.values | |
x = x.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is the use case for this, but I don't think many Xarray users have already tried providing to .sel()
slice objects with elements that consist of large Variable or DataArray objects?
We could still check their dimensions (ndim == 0) before coercing it as Numpy arrays if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only comes up if somebody uses a slice object, e.g., ds.sel(x=slice(i, j))
. In this case i
and j
would be loaded into memory.
I think this is probably OK (it is the existing behavior) but on the other hand it would also probably be slightly safer to use .data
, which would not implicitly convert a dask array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using .data
wouldn't work with the code below, which expects a numpy array. Is there an easy way to get the scalar value from a dask or any duck array? I don't think so? dask/dask#2961. But I guess we could check len(np.shape(x)) != 0
and raise before coercing to a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess we could check len(np.shape(x)) != 0 and raise before coercing to a numpy array?
Hmm checking this for any x
wouldn't work with multi-index and slice objects with tuple elements:
idx = pd.MultiIndex.from_tuples([(0, 'a'), (1, 'b')]
da = xr.DataArray(idx)
da.sel(dim_0=slice((0, 'a'), (1, 'b')))
np.shape((0, 'a')) # returns (2,)
Thanks everyone for your comments on this PR! I'll leave this PR open for another few days in case someone has not yet had a chance to review it. Otherwise, if that's ok I plan to merge it sometime early next week so that I can move forward with the refactoring. In the next step, I'd like to try and see if we can update the data model (i.e., break the "1-dimension <-> 1-index" relationship) and get rid of multi-index virtual coordinates (use real coordinates instead) with minimal user-facing changes (or ideally none). |
Xarray label-based data selection now relies on a newly added
xarray.Index.query(self, labels: Dict[Hashable, Any]) -> Tuple[Any, Optional[None, Index]]
method where:labels
is a always a dictionary with coordinate name(s) as key(s) and the corresponding selection label(s) as values.sel
with some coordinate(s)/label(s) pairs, those are first grouped by index so that only the relevant pairs are passed to anIndex.query
For a simple
pd.Index
,labels
always corresponds to a 1-item dictionary like{'coord_name': label_values}
, which is not very useful in this case, but this format is useful forpd.MultiIndex
and will likely be for other, custom indexes.Moving the label->positional indexer conversion logic into
PandasIndex.query()
, I've tried to separatepd.Index
vspd.MultiIndex
concerns by adding a newPandasMultiIndex
wrapper class (it will probably be useful for other things as well) and refactor the complex logic that was implemented inconvert_label_indexer
. Hopefully it is a bit clearer now.Working towards a more flexible/generic system, we still need to figure out how to:
method
andtolerance
forpd.Index
but in a more generic wayx
with a simplepd.Index
)This could be done in follow-up PRs..
Side note: I've initially tried to return from
xindexes
items for multi-index levels as well (not only index dimensions), but it's probably wiser to save this for later (when we'll tackle the multi-index virtual coordinate refactoring) as there are many places in Xarray where this is clearly not expected.Happy to hear your thoughts @pydata/xarray.