Skip to content

MultiIndex and data selection #767

Closed
@benbovy

Description

@benbovy

[Edited for more clarity]

First of all, I find the MultiIndex very useful and I'm looking forward to see the TODOs in #719 implemented in the next releases, especially the three first ones in the list!

Apart from these issues, I think that some other aspects may be improved, notably regarding data selection. Or maybe I've not correctly understood how to deal with multi-index and data selection...

To illustrate this, I use some fake spectral data with two discontinuous bands of different length / resolution:

In [1]: import pandas as pd

In [2]: import xarray as xr

In [3]: band = np.array(['foo', 'foo', 'bar', 'bar', 'bar'])

In [4]: wavenumber = np.array([4050.2, 4050.3, 4100.1, 4100.3, 4100.5])

In [5]: spectrum = np.array([1.7e-4, 1.4e-4, 1.2e-4, 1.0e-4, 8.5e-5])

In [6]: s = pd.Series(spectrum, index=[band, wavenumber])

In [7]: s.index.names = ('band', 'wavenumber')

In [8]: da = xr.DataArray(s, dims='band_wavenumber')

In [9]: da
Out[9]:
<xarray.DataArray (band_wavenumber: 5)>
array([  1.70000000e-04,   1.40000000e-04,   1.20000000e-04,
         1.00000000e-04,   8.50000000e-05])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('foo', 4050.2) ...

I extract the band 'bar' using sel:

In [10]: da_bar = da.sel(band_wavenumber='bar')

In [11]: da_bar
Out[11]:
<xarray.DataArray (band_wavenumber: 3)>
array([  1.20000000e-04,   1.00000000e-04,   8.50000000e-05])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('bar', 4100.1) ...

It selects the data the way I want, although using the dimension name is confusing in this case. It would be nice if we can also use the MultiIndex names as arguments of the sel method, even though I don't know if it is easy to implement.

Futhermore, da_bar still has the 'band_wavenumber' dimension and the 'band' index-level, but it is not very useful anymore. Ideally, I'd rather like to obtain a DataArray object with a 'wavenumber' dimension / coordinate and the 'bar' band name dropped from the multi-index, i.e., something would require automatic index-level removal and/or automatic unstack when selecting data.

Extracting the band 'bar' from the pandas Series object gives something closer to what I need (see below), but using pandas is not an option as my spectral data involves other dimensions (e.g., time, scans, iterations...) not shown here for simplicity.

In [12]: s_bar = s.loc['bar']

In [13]: s_bar
Out[13]:
wavenumber
4100.1    0.000120
4100.3    0.000100
4100.5    0.000085
dtype: float64

The problem is also that the unstacked DataArray object resulting from the selection has the same dimensions and size than the original, unstacked DataArray object. The only difference is that unselected values are replaced by nan.

In [13]: da.unstack('band_wavenumber')
Out[13]:
<xarray.DataArray (band: 2, wavenumber: 5)>
array([[             nan,              nan,   1.20000000e-04,
          1.00000000e-04,   8.50000000e-05],
       [  1.70000000e-04,   1.40000000e-04,              nan,
                     nan,              nan]])
Coordinates:
  * band        (band) object 'bar' 'foo'
  * wavenumber  (wavenumber) float64 4.05e+03 4.05e+03 4.1e+03 4.1e+03 4.1e+03

In [14]: da_bar.unstack('band_wavenumber')
Out[14]:
<xarray.DataArray (band: 2, wavenumber: 5)>
array([[             nan,              nan,   1.20000000e-04,
          1.00000000e-04,   8.50000000e-05],
       [             nan,              nan,              nan,
                     nan,              nan]])
Coordinates:
  * band        (band) object 'bar' 'foo'
  * wavenumber  (wavenumber) float64 4.05e+03 4.05e+03 4.1e+03 4.1e+03 4.1e+03

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions