Skip to content

DataArray.set_index can add new dimensions that are absent on the underlying array. #9278

@max-sixty

Description

@max-sixty

What is your issue?

I know we've done lots of great work on indexes. But I worry that we've made some basic things confusing. (Or at least I'm confused, very possibly I'm being slow this morning?)

A colleague asked how to group by multiple coords. I told them to set_index(foo=coord_list).groupby("foo"). But that seems to work inconsistently.


Here it works great:

da = xr.DataArray(
    np.array([1, 2, 3, 0, 2, np.nan]),
    dims="d",
    coords=dict(
        labels1=("d", np.array(["a", "b", "c", "c", "b", "a"])),
        labels2=("d", np.array(["x", "y", "z", "z", "y", "x"])),
    ),
)
da

<xarray.DataArray (d: 6)> Size: 48B
array([ 1.,  2.,  3.,  0.,  2., nan])
Coordinates:
    labels1  (d) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
    labels2  (d) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
Dimensions without coordinates: d

Then make the two coords into a multiindex along d and group by d — we successfully get a value for each of the three values on d:

da.set_index(d=['labels1', 'labels2']).groupby('d').mean()

<xarray.DataArray (d: 3)> Size: 24B
array([1. , 2. , 1.5])
Coordinates:
  * d        (d) object 24B ('a', 'x') ('b', 'y') ('c', 'z')

But here it doesn't work as expected:

da = xr.DataArray(
    np.array([1, 2, 3, 0, 2, np.nan]),
    dims="time",
    coords=dict(
        time=("time", pd.date_range("2001-01-01", freq="ME", periods=6)),
        labels1=("time", np.array(["a", "b", "c", "c", "b", "a"])),
        labels2=("time", np.array(["x", "y", "z", "z", "y", "x"])),
    ),
)
>>> da

<xarray.DataArray (time: 6)> Size: 48B
array([ 1.,  2.,  3.,  0.,  2., nan])
Coordinates:
  * time     (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
    labels1  (time) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
    labels2  (time) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
reindexed = da.set_index(combined=['labels1', 'labels2'])
reindexed

<xarray.DataArray (time: 6)> Size: 48B
array([ 1.,  2.,  3.,  0.,  2., nan])
Coordinates:
  * time      (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
  * combined  (combined) object 48B MultiIndex
  * labels1   (combined) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
  * labels2   (combined) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'

Then we try grouping by combined, and we get a value for every value of combined and time?

reindexed.groupby('combined').mean()

<xarray.DataArray (time: 6, combined: 3)> Size: 144B
array([[ 1.,  1.,  1.],
       [ 2.,  2.,  2.],
       [ 3.,  3.,  3.],
       [ 0.,  0.,  0.],
       [ 2.,  2.,  2.],
       [nan, nan, nan]])
Coordinates:
  * time      (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
  * combined  (combined) object 24B ('a', 'x') ('b', 'y') ('c', 'z')

I'm guessing the reason is that combined (combined) object 48B MultiIndex is treated as a dimension?

  • But it's not a dimension, the array lists <xarray.DataArray (time: 6)> Size: 48B as the dimensions. What's a good mental model for this?
  • I might have expected reindexed.groupby('combined').mean(...) or reindexed.groupby('combined').mean('time') to reduce over the time dimension. But that gets even more confusing — we then don't reduce over the groups of combined!

To the extent this sort of point is correct and it's not just me misunderstanding: I know we've done some really great work recently on expanding xarray forward. But one of the main reasons I fell in love with xarray 10 years ago was how explicit the API was relative to pandas. And I worry that this sort of thing sets us back a bit.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions