Description
Hello! Thank you for a fantastic offering with this package. This has been very helpful to get some quantile mapping up and running quickly. As a thank you, I wanted to offer some optimization I did on gridded bias correction that you might find useful. I specialize in vectorizing/optimizing gridded data, but do not have the capacity right now to open a full PR.
I used cm.adjust_3d
on a ~111x111 lat lon grid with 40,000 time steps. The progress bar estimated around 2.5 hours for this but I didn't run it in full. With the below implementation, it ran in 1 minute.
You need a dask cluster running and a dask dataset of course to reap those benefits, but the implementation will speed up in-memory datasets too.
import numpy as np
import xarray as xr
from cmethods import CMethods as cm
def quantile_map_3d(
obs: xr.DataArray,
simh: xr.DataArray,
simp: xr.DataArray,
n_quantiles: int,
kind: str,
):
"""Quantile mapping vectorized for 3D operations."""
def qmap(
obs: xr.DataArray,
simh: xr.DataArray,
simp: xr.DataArray,
n_quantiles: int,
kind: str,
) -> np.array:
"""Helper for apply ufunc to vectorize/parallelize the bias correction step."""
return cm.quantile_mapping(
obs=obs, simh=simh, simp=simp, n_quantiles=n_quantiles, kind=kind
)
result = xr.apply_ufunc(
qmap,
obs,
simh,
# Need to spoof a fake time axis since 'time' coord on full dataset is different
# than 'time' coord on training dataset.
simp.rename({"time": "t2"}),
dask="parallelized",
vectorize=True,
# This will vectorize over the time dimension, so will submit each grid cell
# independently
input_core_dims=[["time"], ["time"], ["t2"]],
# Need to denote that the final output dataset will be labeled with the
# spoofed time coordinate
output_core_dims=[["t2"]],
kwargs={"n_quantiles": n_quantiles, "kind": kind},
)
# Rename to proper coordinate name.
result = result.rename({"t2": "time"})
# ufunc will put the core dimension to the end (time), so want to preserve original
# order where time is commonly first.
result = result.transpose(*obs.dims)
return result
The nice thing about this is that it can handle 1D datasets without any issue. The limitation is they always have to be xarray objects. But it works with dask or in-memory datasets and any arbitrary dimensions as long as a labeled time
dimension exists.
The other great thing is you could just implement the apply_ufunc
wrapper to every single bias correction code without the need for a separate adjust_3d
function. A user can pass in 1D or 2D+ data without any change in code.
Example:
obs = xr.DataArray(
[[1, 2, 3, 4], [2, 3, 4, 5]],
dims=["x", "time"],
coords={"x": [0, 1], "time": pd.date_range("2023-10-25", freq="D", periods=4)},
).transpose("time", "x")
simh = xr.DataArray(
[[2, 1, 5, 4], [3, 9, 1, 4]],
dims=["x", "time"],
coords={"x": [0, 1], "time": pd.date_range("2023-10-25", freq="D", periods=4)},
).transpose("time", "x")
simp = xr.DataArray(
[[7, 9, 10, 14], [12, 13, 14, 15]],
dims=["x", "time"],
coords={"x": [0, 1], "time": pd.date_range("2040-10-25", freq="D", periods=4)},
).transpose("time", "x")
# 2D dataset
>>> quantile_map_3d(obs, simh, simp, 250, "*")
<xarray.DataArray (time: 4, x: 2)>
array([[5., 9.],
[5., 9.],
[5., 9.],
[5., 9.]])
Coordinates:
* x (x) int64 0 1
* time (time) datetime64[ns] 2040-10-25 2040-10-26 2040-10-27 2040-10-28
# 1D dataset
>>> quantile_map_3d(obs.isel(x=0), simh.isel(x=0), simp.isel(x=0), 250, "*")
<xarray.DataArray (time: 4)>
array([5., 5., 5., 5.])
Coordinates:
x int64 0
* time (time) datetime64[ns] 2040-10-25 2040-10-26 2040-10-27 2040-10-28