Skip to content

'drop_duplicates' behaves differently when using 1 vs many coordinates for an index #8499

Open
@jbweston

Description

@jbweston

What happened?

I am trying to drop_duplicates from a DataArray based on the values of some of the coordinates,
starting from a DataArray with coordinates, but no indexes.

To accomplish this, I call 'DataArray.set_xindex' with the appropriate coordinate names,
and then call 'drop_duplicates' on the resulting DataArray, like so:
 

from xarray import DataArray
import numpy as np

test_array = DataArray(
    np.random.rand(5), 
    coords=dict(x=("sample", [1, 2, 1, 2, 1]), y=("sample", [-1] * 5)),
    dims="sample",
)

# output DataArray's 'sample' dimension has length 2, as expected
good = test_array.set_xindex(["x", "y"]).drop_duplicates("sample")
assert len(good) == 2

The above functions as expected; 'good' has had its duplicates dropped,
and we are left with a DataArray of length 2.

However, the following does not function as I would expect:

# All the 'y's are '-1', so we expect the same duplicates as before to be dropped,
# even if we don't include the 'y' values in the index.
bad = test_array.set_xindex("x").drop_duplicates("sample")
# But this assert fails! 'drop_duplicates' does not drop anything
assert not bad.equals(test_array)

What did you expect to happen?

I expected drop_duplicates to drop the duplicates when I was using only a single coordinate for the index.

Minimal Complete Verifiable Example

from xarray import DataArray
import numpy as np

test_array = DataArray(
    range(5), 
    coords=dict(x=("sample", [1, 2, 1, 2, 1]), y=("sample", [-1] * 5)),
    dims="sample",
)

# output DataArray's 'sample' dimension has length 2, as expected
good = test_array.set_xindex(["x", "y"]).drop_duplicates("sample")
# And indeed there are only 2 elements left after dropping duplicates.
assert len(good) == 2

# All the 'y's are '-1', so we expect the same duplicates as before to be dropped,
bad = test_array.drop_vars("y").set_xindex("x").drop_duplicates("sample")
# But this assert fails! 'drop_duplicates' does not drop anything
assert not bad.equals(test_array.drop_vars("y"))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.133.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.1

xarray: 2023.11.0
pandas: 2.1.0
numpy: 1.24.4
scipy: 1.11.2
netCDF4: 1.6.3
pydap: None
h5netcdf: 1.2.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
iris: None
bottleneck: None
dask: 2023.9.1
distributed: 2023.9.1
matplotlib: 3.7.2
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.1.2
pip: 23.2.1
conda: 23.7.3
pytest: 7.4.2
mypy: None
IPython: 8.15.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions