Skip to content

DataArray.quantile is very slow for Dask applications (and generally) #9692

Closed
@phofl

Description

@phofl

What happened?

I've recently run into this with a real world workload where calling quantile on a DataArray was approx. 30x-40x slower than just calling median (6 minutes to 2.5 hours). We profiled this a little and it basically goes down to the NumPy implementation of quantiles and the shapes of our chunks. We had a relatively small time axis (i.e. 50-120 elements).

NumPy has a 1d quantile function only, so it calls apply_along_axis and basically iterates over the other dimensions which clocks up the GIL quite badly (it's also not very fast generally speaking, but running single threaded doesn't make it that bad). Median has a special implementation here that avoids this problem to some degree.

Here is a snapshot of a dask Dashboard running quantile

Screenshot 2024-10-28 at 22 33 52

The purple one is np.nanquantile, the green one is the custom implementation. These are workers with 2 threads, it gets exponentially worse with more threads because they are all locking each other up. 4 threads makes the runtime of a single task balloon to 220s

cc @dcherian

What did you expect to happen?

Things should be faster.

We can do 2 things on the Dask side to make things work better:

  • Add a dask.array.nanquantile function that xarray can call directly. This would basically be a map_blocks with a custom quantile function
  • Add a helper quantile function that Xarray could plug in instead of np.nanquantile if a variable is chunked

I made some timings that are shown in the screen shot above. The custom implementation runs in 1.3 seconds (the green one). The custom implementation will only cover the most common cases and dispatch to the numpy quantile function otherwise.

This will (by extension) also help groupby quantile if I understand things correctly, since it's just calling into the regular quantile per group as far as I understand this.

Minimal Complete Verifiable Example

import dask.array as da
import numpy as np
import xarray as xr


darr = xr.DataArray(
    da.random.random((8944, 7531, 50), chunks=(904, 713, -1)),
    dims=["x", "y", "time"],
)
darr.quantile(q=0.75, dim="time")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 15:57:01) [Clang 17.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 23.3.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: None

xarray: 2024.9.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.14.0
netCDF4: None
pydap: None
h5netcdf: 1.3.0
h5py: 3.12.1
zarr: 2.18.2
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: 1.4.1
dask: 2024.10.0+8.g6a18fe07
distributed: 2024.10.0+11.gafa6e8db
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.6.1
cupy: None
pint: None
sparse: 0.15.4
flox: 0.9.9
numpy_groupies: 0.11.2
setuptools: 75.1.0
pip: 24.2
conda: None
pytest: 8.3.3
mypy: None
IPython: 8.28.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions