-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
I know we've done lots of great work on indexes. But I worry that we've made some basic things confusing. (Or at least I'm confused, very possibly I'm being slow this morning?)
A colleague asked how to group by multiple coords. I told them to set_index(foo=coord_list).groupby("foo"). But that seems to work inconsistently.
Here it works great:
da = xr.DataArray(
np.array([1, 2, 3, 0, 2, np.nan]),
dims="d",
coords=dict(
labels1=("d", np.array(["a", "b", "c", "c", "b", "a"])),
labels2=("d", np.array(["x", "y", "z", "z", "y", "x"])),
),
)da
<xarray.DataArray (d: 6)> Size: 48B
array([ 1., 2., 3., 0., 2., nan])
Coordinates:
labels1 (d) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
labels2 (d) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
Dimensions without coordinates: d
Then make the two coords into a multiindex along d and group by d — we successfully get a value for each of the three values on d:
da.set_index(d=['labels1', 'labels2']).groupby('d').mean()
<xarray.DataArray (d: 3)> Size: 24B
array([1. , 2. , 1.5])
Coordinates:
* d (d) object 24B ('a', 'x') ('b', 'y') ('c', 'z')But here it doesn't work as expected:
da = xr.DataArray(
np.array([1, 2, 3, 0, 2, np.nan]),
dims="time",
coords=dict(
time=("time", pd.date_range("2001-01-01", freq="ME", periods=6)),
labels1=("time", np.array(["a", "b", "c", "c", "b", "a"])),
labels2=("time", np.array(["x", "y", "z", "z", "y", "x"])),
),
)>>> da
<xarray.DataArray (time: 6)> Size: 48B
array([ 1., 2., 3., 0., 2., nan])
Coordinates:
* time (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
labels1 (time) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
labels2 (time) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
reindexed = da.set_index(combined=['labels1', 'labels2'])
reindexed
<xarray.DataArray (time: 6)> Size: 48B
array([ 1., 2., 3., 0., 2., nan])
Coordinates:
* time (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
* combined (combined) object 48B MultiIndex
* labels1 (combined) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
* labels2 (combined) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
Then we try grouping by combined, and we get a value for every value of combined and time?
reindexed.groupby('combined').mean()
<xarray.DataArray (time: 6, combined: 3)> Size: 144B
array([[ 1., 1., 1.],
[ 2., 2., 2.],
[ 3., 3., 3.],
[ 0., 0., 0.],
[ 2., 2., 2.],
[nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
* combined (combined) object 24B ('a', 'x') ('b', 'y') ('c', 'z')I'm guessing the reason is that combined (combined) object 48B MultiIndex is treated as a dimension?
- But it's not a dimension, the array lists
<xarray.DataArray (time: 6)> Size: 48Bas the dimensions. What's a good mental model for this? - I might have expected
reindexed.groupby('combined').mean(...)orreindexed.groupby('combined').mean('time')to reduce over thetimedimension. But that gets even more confusing — we then don't reduce over the groups ofcombined!
To the extent this sort of point is correct and it's not just me misunderstanding: I know we've done some really great work recently on expanding xarray forward. But one of the main reasons I fell in love with xarray 10 years ago was how explicit the API was relative to pandas. And I worry that this sort of thing sets us back a bit.