-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Is your feature request related to a problem?
Currently, Xarray's GroupBy operations are limited to single variables. Grouping by multiple coordinates (e.g., time.year
and time.season
) requires creating a new set of coordinates before grouping due to the xarray limitations described below (source)
xarray >= 2024.09.0
now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.
Related code in xcdat
for temporal grouping:
Lines 1266 to 1322 in c9bcbcd
def _label_time_coords(self, time_coords: xr.DataArray) -> xr.DataArray: | |
"""Labels time coordinates with a group for grouping. | |
This methods labels time coordinates for grouping by first extracting | |
specific xarray datetime components from time coordinates and storing | |
them in a pandas DataFrame. After processing (if necessary) is performed | |
on the DataFrame, it is converted to a numpy array of datetime | |
objects. This numpy serves as the data source for the final | |
DataArray of labeled time coordinates. | |
Parameters | |
---------- | |
time_coords : xr.DataArray | |
The time coordinates. | |
Returns | |
------- | |
xr.DataArray | |
The DataArray of labeled time coordinates for grouping. | |
Examples | |
-------- | |
Original daily time coordinates: | |
>>> <xarray.DataArray 'time' (time: 4)> | |
>>> array(['2000-01-01T12:00:00.000000000', | |
>>> '2000-01-31T21:00:00.000000000', | |
>>> '2000-03-01T21:00:00.000000000', | |
>>> '2000-04-01T03:00:00.000000000'], | |
>>> dtype='datetime64[ns]') | |
>>> Coordinates: | |
>>> * time (time) datetime64[ns] 2000-01-01T12:00:00 ... 2000-04-01T03:00:00 | |
Daily time coordinates labeled by year and month: | |
>>> <xarray.DataArray 'time' (time: 3)> | |
>>> array(['2000-01-01T00:00:00.000000000', | |
>>> '2000-03-01T00:00:00.000000000', | |
>>> '2000-04-01T00:00:00.000000000'], | |
>>> dtype='datetime64[ns]') | |
>>> Coordinates: | |
>>> * time (time) datetime64[ns] 2000-01-01T00:00:00 ... 2000-04-01T00:00:00 | |
""" | |
df_dt_components: pd.DataFrame = self._get_df_dt_components(time_coords) | |
dt_objects = self._convert_df_to_dt(df_dt_components) | |
time_grouped = xr.DataArray( | |
name="_".join(df_dt_components.columns), | |
data=dt_objects, | |
coords={self.dim: time_coords[self.dim]}, | |
dims=[self.dim], | |
attrs=time_coords[self.dim].attrs, | |
) | |
time_grouped.encoding = time_coords[self.dim].encoding | |
return time_grouped |
Current temporal averaging logic (workaround for multi-variable grouping):
- Preprocess time coordinates (e.g., drop leap days, subset based on reference climatology)
- Transform time coordinates from an
xarray.DataArray
to apandas.DataFrame
,
a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups
b. Process the DataFrame including:Mapping of months to custom seasons for custom seasonal groupingNow done with Xarray/NumPy via Add support for custom seasons spanning calendar years #423Correction of "DJF" seasons by shifting Decembers over to the next yearNow done with Xarray/NumPy via Add support for custom seasons spanning calendar years #423- Mapping of seasons to their mid months to create
cftime
coordinates (season strings aren't supported incftime
/datetime
objects)
- Convert DataFrame to
cftime
objects to represent new time coordinates - Replace existing time coordinates in the DataArray with new time coordinates
- Group DataArray with new time coordinates for the mean
Describe the solution you'd like
It is would be simpler and possibly more performant to leverage Xarray's newly added support for grouping by multiple variables (e.g., .groupby(["time.year", "time.season"])
) instead of using Pandas to store and manipulate Datetime components. This solution will reduce a lot of the internal complexities involved with the temporal averaging API.
Describe alternatives you've considered
Multi-variable grouping was originally done using pd.MultiIndex
but we shifted away from this approach because this object cannot be written out to netcdf4
. Also pd.MultiIndex
is not the standard object type for representing time coordinates in xarray. The standard object types are np.datetime64
and cftime
.
Additional context
Future solution through xarray
+ flox
:
- Once Multiple groupers v3 xarray-contrib/flox#76 is released in a new
xarray
version in Update GroupBy constructor for grouping by multiple variables, dask arrays pydata/xarray#6610, we should be able to do this. - Also, Enable
flox
inGroupBy
andresample
pydata/xarray#5734 is now merged which improves.groupby()
performance significantly.
- Support group_over pydata/xarray#324 (comment)
- calculating climatologies efficiently pangeo-data/pangeo#271
- https://stackoverflow.com/questions/37008103/python-xarray-grouping-by-multiple-parameters
- https://stackoverflow.com/questions/54776283/how-to-call-the-xarrays-groupby-function-to-group-data-by-a-combination-of-year
- https://stackoverflow.com/questions/69784076/xarray-groupby-according-to-multi-indexs
Metadata
Metadata
Assignees
Labels
Type
Projects
Status