Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow initilization of dataset.interp #4739

Closed
Illviljan opened this issue Dec 29, 2020 · 2 comments · Fixed by #4740
Closed

Slow initilization of dataset.interp #4739

Illviljan opened this issue Dec 29, 2020 · 2 comments · Fixed by #4740

Comments

@Illviljan
Copy link
Contributor

What happened:
When interpolating a dataset with >2000 dask variables a lot of time is spent in da.unifying_chunks because da.unifying_chunks forces all variables and coordinates to a dask array.
xarray on the other hand forces coordinates to pd.Index even if the coordinates was dask.array when the dataset was first created.

What you expected to happen:
If the coords of the dataset was initialized as dask arrays they should stay lazy.

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np
import dask.array as da

a = np.arange(0, 2000)
b = np.core.defchararray.add("long_variable_name", a.astype(str))
coords = dict(time=da.array([0, 1]))
data_vars = dict()
for v in b:
    data_vars[v] = xr.DataArray(
        name=v,
        data=da.array([3, 4]),
        dims=["time"],
        coords=coords
    )
ds0 = xr.Dataset(data_vars)
ds0 = ds0.interp(
    time=da.array([0, 0.5, 1]),
    assume_sorted=True,
    kwargs=dict(fill_value=None),
)

Anything else we need to know?:
Some thoughts:

  • Why can't coordinates be lazy?
    • Can we use dask.dataframe.Index instead of pd.Index when creating IndexVariables?
  • There's no time saved converting to dask arrays in missing.interp_func. But some time could be saved if we could convert them to dask arrays in xr.Dataset.interp before the variable loop starts.
  • Can we still store the dask array in IndexVariable and use a to_dask_array()-method to quickly get it?
    • Initializing the dataarrays will still be slow though since it still has to force the dask array to pd.Index.

Environment:

Output of xr.show_versions()

xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
libhdf5: 1.10.4
libnetcdf: None

xarray: 0.16.2
pandas: 1.1.5
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 51.0.0.post20201207
pip: 20.3.3
conda: 4.9.2
pytest: 6.2.1
IPython: 7.19.0
sphinx: 3.4.0

@dcherian
Copy link
Contributor

dcherian commented Dec 29, 2020

We don't support lazy index variables yet (#1603) so you can't interpolate to a dask variable.

But some time could be saved if we could convert them to dask arrays in xr.Dataset.interp before the variable loop starts.

This may be true. I think we could convert x and destination to dask (only once) if any of the variables to be interpolated are dask-arrays and pass that to interp_func here rather than passing IndexVariables through.

interped = interp_func(
var.transpose(*original_dims).data, x, destination, method, kwargs
)

OTOH I found some easier optimizations. See #4740

  1. Passing meta to blockwise saves 0.5s in your example.
  2. Another thing we can do is call _localize at the Dataset level rather than within the variable loop. This is taking 1.65s most of which is in 4000 calls to get_loc. At the Dataset level, this becomes just 2 calls to get_loc

@dcherian
Copy link
Contributor

But some time could be saved if we could convert them to dask arrays in xr.Dataset.interp before the variable loop starts.

Now implemented. Runtime has dropped from 5.3s to 2.3s (!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants