Description
[Edited for more clarity]
First of all, I find the MultiIndex very useful and I'm looking forward to see the TODOs in #719 implemented in the next releases, especially the three first ones in the list!
Apart from these issues, I think that some other aspects may be improved, notably regarding data selection. Or maybe I've not correctly understood how to deal with multi-index and data selection...
To illustrate this, I use some fake spectral data with two discontinuous bands of different length / resolution:
In [1]: import pandas as pd
In [2]: import xarray as xr
In [3]: band = np.array(['foo', 'foo', 'bar', 'bar', 'bar'])
In [4]: wavenumber = np.array([4050.2, 4050.3, 4100.1, 4100.3, 4100.5])
In [5]: spectrum = np.array([1.7e-4, 1.4e-4, 1.2e-4, 1.0e-4, 8.5e-5])
In [6]: s = pd.Series(spectrum, index=[band, wavenumber])
In [7]: s.index.names = ('band', 'wavenumber')
In [8]: da = xr.DataArray(s, dims='band_wavenumber')
In [9]: da
Out[9]:
<xarray.DataArray (band_wavenumber: 5)>
array([ 1.70000000e-04, 1.40000000e-04, 1.20000000e-04,
1.00000000e-04, 8.50000000e-05])
Coordinates:
* band_wavenumber (band_wavenumber) object ('foo', 4050.2) ...
I extract the band 'bar' using sel
:
In [10]: da_bar = da.sel(band_wavenumber='bar')
In [11]: da_bar
Out[11]:
<xarray.DataArray (band_wavenumber: 3)>
array([ 1.20000000e-04, 1.00000000e-04, 8.50000000e-05])
Coordinates:
* band_wavenumber (band_wavenumber) object ('bar', 4100.1) ...
It selects the data the way I want, although using the dimension name is confusing in this case. It would be nice if we can also use the MultiIndex
names as arguments of the sel
method, even though I don't know if it is easy to implement.
Futhermore, da_bar
still has the 'band_wavenumber' dimension and the 'band' index-level, but it is not very useful anymore. Ideally, I'd rather like to obtain a DataArray
object with a 'wavenumber' dimension / coordinate and the 'bar' band name dropped from the multi-index, i.e., something would require automatic index-level removal and/or automatic unstack when selecting data.
Extracting the band 'bar' from the pandas Series
object gives something closer to what I need (see below), but using pandas is not an option as my spectral data involves other dimensions (e.g., time, scans, iterations...) not shown here for simplicity.
In [12]: s_bar = s.loc['bar']
In [13]: s_bar
Out[13]:
wavenumber
4100.1 0.000120
4100.3 0.000100
4100.5 0.000085
dtype: float64
The problem is also that the unstacked DataArray
object resulting from the selection has the same dimensions and size than the original, unstacked DataArray
object. The only difference is that unselected values are replaced by nan
.
In [13]: da.unstack('band_wavenumber')
Out[13]:
<xarray.DataArray (band: 2, wavenumber: 5)>
array([[ nan, nan, 1.20000000e-04,
1.00000000e-04, 8.50000000e-05],
[ 1.70000000e-04, 1.40000000e-04, nan,
nan, nan]])
Coordinates:
* band (band) object 'bar' 'foo'
* wavenumber (wavenumber) float64 4.05e+03 4.05e+03 4.1e+03 4.1e+03 4.1e+03
In [14]: da_bar.unstack('band_wavenumber')
Out[14]:
<xarray.DataArray (band: 2, wavenumber: 5)>
array([[ nan, nan, 1.20000000e-04,
1.00000000e-04, 8.50000000e-05],
[ nan, nan, nan,
nan, nan]])
Coordinates:
* band (band) object 'bar' 'foo'
* wavenumber (wavenumber) float64 4.05e+03 4.05e+03 4.1e+03 4.1e+03 4.1e+03