Skip to content

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

Closed
@dcherian

Description

@dcherian

What is your issue?

flox supports grouping by multiple variables (would fix #324, #1056) and grouping by dask variables (would fix #2852).

To enable this in GroupBy we need to update the constructor's signature to

  1. Accept multiple "by" variables.
  2. Accept "expected group labels" for grouping by dask variables (like bins for groupby_bins which already supports grouping by dask variables). This lets us construct the output coordinate without evaluating the dask variable.
  3. We may also want to simultaneously group by a categorical variable (season) and bin by a continuous variable (air temperature). So we also need a way to indicate whether the "expected group labels" are "bin edges" or categories.

The signature in flox is (may be errors!)

xarray_reduce(
    obj: Dataset | DataArray,
    *by: DataArray | str,
    func: str | Aggregation,
    expected_groups: Sequence | np.ndarray | None = None,
    isbin: bool | Sequence[bool] = False,
    ...
)

You would calculate that last example using flox as

xarray_reduce(
   ds,
    "season", "air_temperature", 
    expected_groups=[None, np.arange(21, 30, 1)],
    isbin=[False, True],
    ...
)

The use of expected_groups and isbin seems ugly to me (the names could also be better!)


I propose we update groupby's signature to

  1. change group: DataArray | str to group: DataArray | str | Iterable[str] | Iterable[DataArray]
  2. We could add a top-level xr.Bins object that wraps bin edges + any kwargs to be passed to pandas.cut. Note our current groupby_bins signature has a bunch of kwargs passed directly to pandas.cut.
  3. Finally add groups: None | ArrayLike | xarray.Bins | Iterable[None | ArrayLike | xarray.Bins] to pass the "expected group labels".
    1. If None, then groups will be auto-detected from non-dask group arrays (if None for a dask group, then raise error).
    2. If xarray.Bins indicates binning by the appropriate variables
    3. If ArrayLike treat as categorical.
    4. groups is a little too similar to group so we should choose a better name.
    5. The ordering of ArrayLike would let us fix Ordered Groupby Keys #757 (pass the seasons in the order you want them in the output)

So then that example becomes

ds.groupby(
    ["season", "air_temperature"], # season is numpy, air_temperature is dask
    groups=[None, xr.Bins(np.arange(21, 30, 1), closed="right")],
)

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions