Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby: Dispatch quantile to flox. #8720

Merged
merged 17 commits into from
Mar 26, 2024
Merged

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Feb 7, 2024

  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@aulemahal would you be able to test against xclim's test suite. I imagine you're doing a bunch of grouped quantiles.

doc/whats-new.rst Outdated Show resolved Hide resolved
@aulemahal
Copy link
Contributor

aulemahal commented Feb 8, 2024

@dcherian It works for the very simple test I did. We have fewer groupby().quantile() than you might think because the main "grouped quantile" function we implement is meant for "day-of-year" grouping including a window around each doy and the whole process is done without calling groupby. That code predates many improvements in xarray, I guess we should now revisit it, when this PR is completed!

One test in xclim's CI where we perform a grouped quantile failed when I modified it to use a chunked array : ValueError: Aggregation nanquantile is only supported for method='blockwise', but the chunking is not right. . I understand the issue and the code did run after a call to flox.xarray.rechunk_for_blockwise. I guess that's seen as a user's responsability here ? Has there be any discussions on automatically running rechunk_for_blockwise ?

@dcherian
Copy link
Contributor Author

dcherian commented Feb 8, 2024

It should do that if you explicitly pass method="blockwise".

But great catch! I'll have to add a fall back. I haven't thought about how to infer that rechunk_to_blockwise is sensible.

@dcherian
Copy link
Contributor Author

dcherian commented Feb 8, 2024

the main "grouped quantile" function we implement is meant for "day-of-year" grouping including a window around each doy and the whole process is done without calling groupby.

When you do this, is the time axis a single chunk?

@dcherian dcherian marked this pull request as draft February 8, 2024 23:08
@aulemahal
Copy link
Contributor

When you do this, is the time axis a single chunk?

That's the "custom implementation" part : we manipulate the array and rechunk so that all members of one group (a day of year plus a window) fall into the same chunk.

@dcherian
Copy link
Contributor Author

Can you link to that section of the code please?

@aulemahal
Copy link
Contributor

The function is here : https://github.com/Ouranosinc/xclim/blob/acbccb8902fb05318c53d76beb1c97067dfb8f11/xclim/core/calendar.py#L627
The grouping is done through a custom multiindex and stacking, and thus it only supports daily data (technically is supports coarser but the function really is meant for daily).

* main: (42 commits)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  Grouper object design doc (pydata#8510)
  Bump the actions group with 2 updates (pydata#8804)
  tokenize() should ignore difference between None and {} attrs (pydata#8797)
  fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791)
  Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782)
  Migrate treenode module. (pydata#8757)
  ...
@dcherian dcherian changed the title groupby: Dispatch median, quantile to flox. groupby: Dispatch quantile to flox. Mar 16, 2024
@dcherian dcherian marked this pull request as ready for review March 16, 2024 02:09
xarray/core/groupby.py Outdated Show resolved Hide resolved
@dcherian dcherian added plan to merge Final call for comments and removed needs review labels Mar 25, 2024
@dcherian
Copy link
Contributor Author

I'll merge tomorrow if there are no comments.

@dcherian dcherian merged commit ee02113 into pydata:main Mar 26, 2024
29 checks passed
@dcherian dcherian deleted the flox-quantile branch March 26, 2024 15:08
dcherian added a commit to dcherian/xarray that referenced this pull request Apr 2, 2024
* main: (26 commits)
  [pre-commit.ci] pre-commit autoupdate (pydata#8900)
  Bump the actions group with 1 update (pydata#8896)
  New empty whatsnew entry (pydata#8899)
  Update reference to 'Weighted quantile estimators' (pydata#8898)
  2024.03.0: Add whats-new (pydata#8891)
  Add typing to test_groupby.py (pydata#8890)
  Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867)
  Check for aligned chunks when writing to existing variables (pydata#8459)
  Add dt.date to plottable types (pydata#8873)
  Optimize writes to existing Zarr stores. (pydata#8875)
  Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886)
  Don't allow overwriting indexes with region writes (pydata#8877)
  Migrate datatree.py module into xarray.core. (pydata#8789)
  warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874)
  groupby: Dispatch quantile to flox. (pydata#8720)
  Opt out of auto creating index variables (pydata#8711)
  Update docs on view / copies (pydata#8744)
  Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869)
  numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865)
  upstream-dev CI: Fix interp and cumtrapz (pydata#8861)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants