Skip to content

Commit

Permalink
GroupBy(chunked-array) (#9522)
Browse files Browse the repository at this point in the history
* GroupBy(chunked-array)

Closes #757
Closes #2852

* Optimizations

* Optimize multi-index construction

* Add tests

* Add whats-new

* Raise errors

* Add docstring

* preserve attrs

* Add test for #757

* Typing fixes

* Handle multiple groupers

* Backcompat

* better backcompat

* fix

* Handle edge case

* comment

* type: ignore
  • Loading branch information
dcherian authored Nov 4, 2024
1 parent 29654fc commit a00bc91
Show file tree
Hide file tree
Showing 10 changed files with 441 additions and 68 deletions.
9 changes: 9 additions & 0 deletions doc/user-guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,15 @@ is identical to
ds.resample(time=TimeResampler("ME"))
The :py:class:`groupers.UniqueGrouper` accepts an optional ``labels`` kwarg that is not present
in :py:meth:`DataArray.groupby` or :py:meth:`Dataset.groupby`.
Specifying ``labels`` is required when grouping by a lazy array type (e.g. dask or cubed).
The ``labels`` are used to construct the output coordinate (say for a reduction), and aggregations
will only be run over the specified labels.
You may use ``labels`` to also specify the ordering of groups to be used during iteration.
The order will be preserved in the output.


.. _groupby.multiple:

Grouping by multiple variables
Expand Down
17 changes: 8 additions & 9 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,21 @@ New Features
~~~~~~~~~~~~
- Added :py:meth:`DataTree.persist` method (:issue:`9675`, :pull:`9682`).
By `Sam Levang <https://github.com/slevang>`_.
- Support lazy grouping by dask arrays, and allow specifying ordered groups with ``UniqueGrouper(labels=["a", "b", "c"])``
(:issue:`2852`, :issue:`757`).
By `Deepak Cherian <https://github.com/dcherian>`_.

Breaking changes
~~~~~~~~~~~~~~~~


Deprecations
~~~~~~~~~~~~

- Grouping by a chunked array (e.g. dask or cubed) currently eagerly loads that variable in to
memory. This behaviour is deprecated. If eager loading was intended, please load such arrays
manually using ``.load()`` or ``.compute()``. Else pass ``eagerly_compute_group=False``, and
provide expected group labels using the ``labels`` kwarg to a grouper object such as
:py:class:`grouper.UniqueGrouper` or :py:class:`grouper.BinGrouper`.

Bug fixes
~~~~~~~~~
Expand Down Expand Up @@ -94,14 +101,6 @@ New Features
(:issue:`9427`, :pull: `9428`).
By `Alfonso Ladino <https://github.com/aladinor>`_.

Breaking changes
~~~~~~~~~~~~~~~~


Deprecations
~~~~~~~~~~~~


Bug fixes
~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1094,7 +1094,7 @@ def _resample(
f"Received {type(freq)} instead."
)

rgrouper = ResolvedGrouper(grouper, group, self)
rgrouper = ResolvedGrouper(grouper, group, self, eagerly_compute_group=False)

return resample_cls(
self,
Expand Down
20 changes: 18 additions & 2 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -6748,6 +6748,7 @@ def groupby(
*,
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
eagerly_compute_group: bool = True,
**groupers: Grouper,
) -> DataArrayGroupBy:
"""Returns a DataArrayGroupBy object for performing grouped operations.
Expand All @@ -6763,6 +6764,11 @@ def groupby(
restore_coord_dims : bool, default: False
If True, also restore the dimension order of multi-dimensional
coordinates.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.
**groupers : Mapping of str to Grouper or Resampler
Mapping of variable name to group by to :py:class:`Grouper` or :py:class:`Resampler` object.
One of ``group`` or ``groupers`` must be provided.
Expand Down Expand Up @@ -6877,7 +6883,9 @@ def groupby(
)

_validate_groupby_squeeze(squeeze)
rgroupers = _parse_group_and_groupers(self, group, groupers)
rgroupers = _parse_group_and_groupers(
self, group, groupers, eagerly_compute_group=eagerly_compute_group
)
return DataArrayGroupBy(self, rgroupers, restore_coord_dims=restore_coord_dims)

@_deprecate_positional_args("v2024.07.0")
Expand All @@ -6892,6 +6900,7 @@ def groupby_bins(
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
duplicates: Literal["raise", "drop"] = "raise",
eagerly_compute_group: bool = True,
) -> DataArrayGroupBy:
"""Returns a DataArrayGroupBy object for performing grouped operations.
Expand Down Expand Up @@ -6928,6 +6937,11 @@ def groupby_bins(
coordinates.
duplicates : {"raise", "drop"}, default: "raise"
If bin edges are not unique, raise ValueError or drop non-uniques.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.
Returns
-------
Expand Down Expand Up @@ -6965,7 +6979,9 @@ def groupby_bins(
precision=precision,
include_lowest=include_lowest,
)
rgrouper = ResolvedGrouper(grouper, group, self)
rgrouper = ResolvedGrouper(
grouper, group, self, eagerly_compute_group=eagerly_compute_group
)

return DataArrayGroupBy(
self,
Expand Down
20 changes: 18 additions & 2 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -10379,6 +10379,7 @@ def groupby(
*,
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
eagerly_compute_group: bool = True,
**groupers: Grouper,
) -> DatasetGroupBy:
"""Returns a DatasetGroupBy object for performing grouped operations.
Expand All @@ -10394,6 +10395,11 @@ def groupby(
restore_coord_dims : bool, default: False
If True, also restore the dimension order of multi-dimensional
coordinates.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.
**groupers : Mapping of str to Grouper or Resampler
Mapping of variable name to group by to :py:class:`Grouper` or :py:class:`Resampler` object.
One of ``group`` or ``groupers`` must be provided.
Expand Down Expand Up @@ -10476,7 +10482,9 @@ def groupby(
)

_validate_groupby_squeeze(squeeze)
rgroupers = _parse_group_and_groupers(self, group, groupers)
rgroupers = _parse_group_and_groupers(
self, group, groupers, eagerly_compute_group=eagerly_compute_group
)

return DatasetGroupBy(self, rgroupers, restore_coord_dims=restore_coord_dims)

Expand All @@ -10492,6 +10500,7 @@ def groupby_bins(
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
duplicates: Literal["raise", "drop"] = "raise",
eagerly_compute_group: bool = True,
) -> DatasetGroupBy:
"""Returns a DatasetGroupBy object for performing grouped operations.
Expand Down Expand Up @@ -10528,6 +10537,11 @@ def groupby_bins(
coordinates.
duplicates : {"raise", "drop"}, default: "raise"
If bin edges are not unique, raise ValueError or drop non-uniques.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.
Returns
-------
Expand Down Expand Up @@ -10565,7 +10579,9 @@ def groupby_bins(
precision=precision,
include_lowest=include_lowest,
)
rgrouper = ResolvedGrouper(grouper, group, self)
rgrouper = ResolvedGrouper(
grouper, group, self, eagerly_compute_group=eagerly_compute_group
)

return DatasetGroupBy(
self,
Expand Down
Loading

0 comments on commit a00bc91

Please sign in to comment.