Skip to content

Dataset.where performances regression. #7516

Open
@Thomas-Z

Description

@Thomas-Z

What happened?

Hello,

I'm using the Dataset.where function to select data based on some fields values and it takes way to much time!
The dask dashboard seems to show some tasks repeating themselves many times.

The provided example uses a 1D array for which the selection could be done with Dataset.sel but with our real usecase we make selections on 2D variables.

This problem seems to have appeared with the 2022.6.0 xarray release, the 2022.3.0 is working as expected.

What did you expect to happen?

Using the 2022.3 release, this selection takes 1.37 seconds.
Using the 2022.6.0 up to the 2023.2.0 (the one from yesterday), this selection takes 8.47 seconds.

This example is a very simple and small one, with real data and use case we simply cannot use this function anymore.

Minimal Complete Verifiable Example

import dask.array as da
import distributed as dist
import xarray as xr


client = dist.Client()

# Using small chunks emphasis the problem
ds = xr.Dataset(
    {"field": xr.DataArray(data=da.empty(shape=10000, chunks=10), dims=("x"))}
)
sel = ds["field"] > 0

ds.where(sel, drop=True)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

Problematic version

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
mypy: None
IPython: 8.7.0
sphinx: 5.3.0

Working version

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
IPython: 8.7.0
sphinx: 5.3.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions