Closed
Description
What is your issue?
flox
supports grouping by multiple variables (would fix #324, #1056) and grouping by dask variables (would fix #2852).
To enable this in GroupBy we need to update the constructor's signature to
- Accept multiple "by" variables.
- Accept "expected group labels" for grouping by dask variables (like
bins
forgroupby_bins
which already supports grouping by dask variables). This lets us construct the output coordinate without evaluating the dask variable. - We may also want to simultaneously group by a categorical variable (season) and bin by a continuous variable (air temperature). So we also need a way to indicate whether the "expected group labels" are "bin edges" or categories.
The signature in flox is (may be errors!)
xarray_reduce(
obj: Dataset | DataArray,
*by: DataArray | str,
func: str | Aggregation,
expected_groups: Sequence | np.ndarray | None = None,
isbin: bool | Sequence[bool] = False,
...
)
You would calculate that last example using flox as
xarray_reduce(
ds,
"season", "air_temperature",
expected_groups=[None, np.arange(21, 30, 1)],
isbin=[False, True],
...
)
The use of expected_groups
and isbin
seems ugly to me (the names could also be better!)
I propose we update groupby's signature to
- change
group: DataArray | str
togroup: DataArray | str | Iterable[str] | Iterable[DataArray]
- We could add a top-level
xr.Bins
object that wraps bin edges + any kwargs to be passed topandas.cut
. Note our current groupby_bins signature has a bunch of kwargs passed directly to pandas.cut. - Finally add
groups: None | ArrayLike | xarray.Bins | Iterable[None | ArrayLike | xarray.Bins]
to pass the "expected group labels".- If
None
, then groups will be auto-detected from non-daskgroup
arrays (ifNone
for a daskgroup
, then raise error). - If
xarray.Bins
indicates binning by the appropriate variables - If
ArrayLike
treat as categorical. groups
is a little too similar togroup
so we should choose a better name.- The ordering of
ArrayLike
would let us fix Ordered Groupby Keys #757 (pass the seasons in the order you want them in the output)
- If
So then that example becomes
ds.groupby(
["season", "air_temperature"], # season is numpy, air_temperature is dask
groups=[None, xr.Bins(np.arange(21, 30, 1), closed="right")],
)
Thoughts?