Closed
Description
Building a dataset from pandas with a multi-index with categorical values:
import pandas as pd
cat = pd.CategoricalDtype(categories=['foo', 'bar', 'baz'])
i1 = pd.Series(['foo', 'bar'], dtype=cat)
i2 = pd.Series(['bar', 'bar'], dtype=cat)
df = pd.DataFrame({'i1': i1, 'i2': i2, 'values': [1, 2]})
ds = df.set_index(['i1', 'i2']).to_xarray()
print(ds)
Expected output:
<xarray.Dataset>
Dimensions: (i1: 2, i2: 1)
Coordinates:
* i1 (i1) object 'foo' 'bar'
* i2 (i2) object 'bar'
Data variables:
values (i1, i2) int64 1 2
Actual output:
<xarray.Dataset>
Dimensions: (i1: 3, i2: 3)
Coordinates:
* i1 (i1) object 'foo' 'bar' 'baz'
* i2 (i2) object 'foo' 'bar' 'baz'
Data variables:
values (i1, i2) float64 nan 1.0 nan nan 2.0 nan nan nan nan
It is not wrong, but it is inconsistent with the non-categorical case (which gives the expected output above) and the single-index case (no filling with NaNs for single index).
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.91-1-MANJARO
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200106
pip: 19.3.1
conda: None
pytest: None
IPython: None
sphinx: None