DataArray.set_index can add new dimensions that are absent on the underlying array.

### What is your issue?

I know we've done lots of great work on indexes. But I worry that we've made some basic things confusing. (Or at least I'm confused, very possibly I'm being slow this morning?)

A colleague asked how to group by multiple coords. I told them to `set_index(foo=coord_list).groupby("foo")`. But that seems to work inconsistently.

---

Here it works great:


```python
da = xr.DataArray(
    np.array([1, 2, 3, 0, 2, np.nan]),
    dims="d",
    coords=dict(
        labels1=("d", np.array(["a", "b", "c", "c", "b", "a"])),
        labels2=("d", np.array(["x", "y", "z", "z", "y", "x"])),
    ),
)

```

```
da

<xarray.DataArray (d: 6)> Size: 48B
array([ 1.,  2.,  3.,  0.,  2., nan])
Coordinates:
    labels1  (d) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
    labels2  (d) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
Dimensions without coordinates: d
```

Then make the two coords into a multiindex along `d` and group by `d` — we successfully get a value for each of the three values on `d`:

```python
da.set_index(d=['labels1', 'labels2']).groupby('d').mean()

<xarray.DataArray (d: 3)> Size: 24B
array([1. , 2. , 1.5])
Coordinates:
  * d        (d) object 24B ('a', 'x') ('b', 'y') ('c', 'z')
```

---

But here it doesn't work as expected:

```python
da = xr.DataArray(
    np.array([1, 2, 3, 0, 2, np.nan]),
    dims="time",
    coords=dict(
        time=("time", pd.date_range("2001-01-01", freq="ME", periods=6)),
        labels1=("time", np.array(["a", "b", "c", "c", "b", "a"])),
        labels2=("time", np.array(["x", "y", "z", "z", "y", "x"])),
    ),
)
```

```
>>> da

<xarray.DataArray (time: 6)> Size: 48B
array([ 1.,  2.,  3.,  0.,  2., nan])
Coordinates:
  * time     (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
    labels1  (time) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
    labels2  (time) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
```

```
reindexed = da.set_index(combined=['labels1', 'labels2'])
reindexed

<xarray.DataArray (time: 6)> Size: 48B
array([ 1.,  2.,  3.,  0.,  2., nan])
Coordinates:
  * time      (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
  * combined  (combined) object 48B MultiIndex
  * labels1   (combined) <U1 24B 'a' 'b' 'c' 'c' 'b' 'a'
  * labels2   (combined) <U1 24B 'x' 'y' 'z' 'z' 'y' 'x'
```

Then we try grouping by `combined`, and we get a value for every value of `combined` _and_ `time`?

```python
reindexed.groupby('combined').mean()

<xarray.DataArray (time: 6, combined: 3)> Size: 144B
array([[ 1.,  1.,  1.],
       [ 2.,  2.,  2.],
       [ 3.,  3.,  3.],
       [ 0.,  0.,  0.],
       [ 2.,  2.,  2.],
       [nan, nan, nan]])
Coordinates:
  * time      (time) datetime64[ns] 48B 2001-01-31 2001-02-28 ... 2001-06-30
  * combined  (combined) object 24B ('a', 'x') ('b', 'y') ('c', 'z')
```

---

I'm guessing the reason is that `combined  (combined) object 48B MultiIndex` is treated as a dimension? 
- But it's not a dimension, the array lists `<xarray.DataArray (time: 6)> Size: 48B` as the dimensions. What's a good mental model for this?
- I might have expected `reindexed.groupby('combined').mean(...)` or `reindexed.groupby('combined').mean('time')` to reduce over the `time` dimension. But that gets even more confusing — we then don't reduce over the groups of `combined`!

---

To the extent this sort of point is correct and it's not just me misunderstanding: I know we've done some really great work recently on expanding xarray forward. But one of the main reasons I fell in love with xarray 10 years ago was how explicit the API was relative to pandas. And I worry that this sort of thing sets us back a bit. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataArray.set_index can add new dimensions that are absent on the underlying array. #9278

What is your issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DataArray.set_index can add new dimensions that are absent on the underlying array. #9278

Description

What is your issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions