Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error concatenating Multiindex variables #3659

Open
hazbottles opened this issue Jan 1, 2020 · 1 comment
Open

Error concatenating Multiindex variables #3659

hazbottles opened this issue Jan 1, 2020 · 1 comment
Labels
topic-combine combine/concat/merge

Comments

@hazbottles
Copy link
Contributor

MCVE Code Sample

>>> import xarray as xr
>>> da = xr.DataArray([0, 1], dims=["location"], coords={"lat": ("location", [10, 11]), "lon": ("location", [20, 21])}).set_index(location=["lat", "lon"])
>>> da2 = xr.DataArray([2, 3], dims=["location"], coords={"lat": ("location", [12, 13]), "lon": ("location", [22, 23])}).set_index(location=["lat", "lon"])
>>> xr.concat([da["location"], da2["location"]], dim="location")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/harry/code/xarray/xarray/core/concat.py", line 135, in concat
    return f(objs, dim, data_vars, coords, compat, positions, fill_value, join)
  File "/home/harry/code/xarray/xarray/core/concat.py", line 431, in _dataarray_concat
    ds = _dataset_concat(
  File "/home/harry/code/xarray/xarray/core/concat.py", line 384, in _dataset_concat
    result = Dataset(result_vars, attrs=result_attrs)
  File "/home/harry/code/xarray/xarray/core/dataset.py", line 541, in __init__
    variables, coord_names, dims, indexes = merge_data_and_coords(
  File "/home/harry/code/xarray/xarray/core/merge.py", line 466, in merge_data_and_coords
    return merge_core(
  File "/home/harry/code/xarray/xarray/core/merge.py", line 556, in merge_core
    assert_unique_multiindex_level_names(variables)
  File "/home/harry/code/xarray/xarray/core/variable.py", line 2363, in assert_unique_multiindex_level_names
    raise ValueError("conflicting MultiIndex level name(s):\n%s" % conflict_str)
ValueError: conflicting MultiIndex level name(s):
'lat' (location), 'lat' (<this-array>)
'lon' (location), 'lon' (<this-array>)

Expected Output

The output should be the same as first concatenating the DataArrays, then extracting the dimension location:

>>> xr.concat([da, da2], dim="location")["location"]
<xarray.DataArray 'location' (location: 4)>
array([(10, 20), (11, 21), (12, 22), (13, 23)], dtype=object)
Coordinates:
  * location  (location) MultiIndex
  - lat       (location) int64 10 11 12 13
  - lon       (location) int64 20 21 22 23

Problem Description

>>> # da["location"] looks like a normal DataArray
>>> location = da["location"]
>>> location
<xarray.DataArray 'location' (location: 2)>
array([(10, 20), (11, 21)], dtype=object)
Coordinates:
  * location  (location) MultiIndex
  - lat       (location) int64 10 11
  - lon       (location) int64 20 21
>>> # but in actual fact, the variable._data is a MultiIndex
>>> location.variable._data
PandasIndexAdapter(array=MultiIndex([(10, 20),
            (11, 21)],
           names=['lat', 'lon']), dtype=dtype('O'))

This is why an error is thrown: variable.assert_unique_multiindex_level_names gets passed two variables: location.variable (the DataArray data values), and also location["location"].variable (the coordinate values), which are both MultiIndexes.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: b3d3b44 python: 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:11:38) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.4.0-18362-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None

xarray: 0.14.1+36.gb3d3b44
pandas: 0.25.3
numpy: 1.18.0
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.9.1
distributed: 2.9.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 42.0.2.post20191201
pip: 19.3.1
conda: None
pytest: 5.3.2
IPython: None
sphinx: None

@hazbottles
Copy link
Contributor Author

hazbottles commented Jan 1, 2020

The solution that makes sense to me is:

Multiindex level name conflicts should only be checked for coordinates, not data variables.

But I've only spent a few hours digging through the codebase to try and understand this problem - I'm not quite sure what the implications would be.

Here is another place where it feels like it makes more sense to only check the MultiIndex level names of coords:

>>> da = xr.DataArray([0, 1], dims=["location"], coords={"lat": ("location", [10, 11]), "lon": ("location", [20, 21])}).set_index(location=["lat", "lon"])
>>> location = da["location"]

# you cannot directly make a dataset with `location` as a data variable
>>> xr.Dataset({"data": location})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/harry/code/xarray/xarray/core/dataset.py", line 541, in __init__
    variables, coord_names, dims, indexes = merge_data_and_coords(
  File "/home/harry/code/xarray/xarray/core/merge.py", line 466, in merge_data_and_coords
    return merge_core(
  File "/home/harry/code/xarray/xarray/core/merge.py", line 556, in merge_core
    assert_unique_multiindex_level_names(variables)
  File "/home/harry/code/xarray/xarray/core/variable.py", line 2363, in assert_unique_multiindex_level_names
    raise ValueError("conflicting MultiIndex level name(s):\n%s" % conflict_str)
ValueError: conflicting MultiIndex level name(s):
'lat' (location), 'lat' (data)
'lon' (location), 'lon' (data)

# but if you go a round-about way, you can exploit that assign_coords only checks 
# the multiindex names of coordinates, not data variables
```python
>>> ds = xr.Dataset({"data": xr.DataArray(data=location.variable._data, dims=["location"])})
>>> ds = ds.assign_coords({"location": location})
>>> ds
<xarray.Dataset>
Dimensions:   (location: 2)
Coordinates:
  * location  (location) MultiIndex
  - lat       (location) int64 10 11
  - lon       (location) int64 20 21
Data variables:
    data      (location) object (10, 20) (11, 21)
>>> ds["data"].variable._data
PandasIndexAdapter(array=MultiIndex([(10, 20),
            (11, 21)],
           names=['lat', 'lon']), dtype=dtype('O'))

If making variable.assert_unique_multiindex_level_names only check coords is the way to go, I'm keen + happy to try putting together a pull request for this.

@dcherian dcherian added the topic-combine combine/concat/merge label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
None yet
Development

No branches or pull requests

2 participants