Skip to content

Conversation

@dcherian
Copy link
Contributor

@dcherian dcherian commented Aug 16, 2024

Builds on #9389


This PR takes an approach similar to that in #924.

The GroupBy(multiple array) problem can be reduced to the GroupBy(single nD variable) problem:

  1. First convert each by array to integer codes ("factorize"). I do this first because factorization is expensive, and so it's cheaper to do it before any possible broadcasting.
  2. Broadcast DataArrays of integer codes against each other.
  3. Then np.ravel_multi_index them to construct a single nD array of integer codes.
  4. At this point, we treat it as a GroupBy(single nD array) problem i.e. reshape down to 1D problem. This isn't great for dask/cubed but it is simple and matches what we do today. That means we can apply all the operations we do with GroupBy today.
    • Importantly UDFs receive "stacked arrays" as they do today when grouping by a nD array. This isn't the most elegant but it preserves dtype. For example, you can't index out a 2D array with group==1 without inserting a missing value if group is:
    [[1, 1, 1],
     [2, 1, 1],
     [2, 2, 1]]
    
  5. Once operations are complete, we assign a new MultiIndex and then unstack back to restore the needed dimensions.
    • I'm not sure this is the best at the moment - what if one of the arrays we group by is a multiIndexed array?

TODO:

  • lots more tests!

@dcherian
Copy link
Contributor Author

dcherian commented Aug 16, 2024

@max-sixty I could use some help with adding tests,docs,docstrings if you have time :)

@dcherian dcherian force-pushed the multiple-groupers-2 branch 2 times, most recently from 8701e9b to aa52ce6 Compare August 17, 2024 03:03
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 17, 2024
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 17, 2024
return newgroup


@dataclass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main new logic.

@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Aug 17, 2024
max-sixty added a commit to max-sixty/xarray that referenced this pull request Aug 17, 2024
While trying to help with pydata#9372, I realize the error message for this could be much better, and so putting this PR in as some penance for my tardiness in helping there
@max-sixty
Copy link
Collaborator

I had a look at where I add something — it already seems extremely good? gb - gb.mean() would be cool but this already seems like a big step forwards.

I missed previous discussions / can decide this later — but do we want to have da.groupby(['foo', 'bar']) as sugar for da.groupby(foo=UniqueGrouper(), bar=UniqueGrouper())?

dcherian pushed a commit that referenced this pull request Aug 17, 2024
* Improve error message on `ds['x', 'y']`

While trying to help with #9372, I realize the error message for this could be much better, and so putting this PR in as some penance for my tardiness in helping there

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@dcherian
Copy link
Contributor Author

Yes it is surprisingly complete.

A section in the narrative docs would be nice to add.

@dcherian
Copy link
Contributor Author

missed previous discussions / can decide this later — but do we want to have da.groupby(['foo', 'bar']) as sugar for da.groupby(foo=UniqueGrouper(), bar=UniqueGrouper())?

I agree. It's a bit too much typing at the moment

@dcherian dcherian force-pushed the multiple-groupers-2 branch 3 times, most recently from 414f95a to b28a3fa Compare August 20, 2024 16:12
@dcherian dcherian mentioned this pull request Aug 21, 2024
1 task
@dcherian dcherian force-pushed the multiple-groupers-2 branch from 2514e90 to da23871 Compare August 21, 2024 04:59
dcherian and others added 2 commits August 21, 2024 09:07
fix docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
@dcherian dcherian force-pushed the multiple-groupers-2 branch from da23871 to 7ec2341 Compare August 21, 2024 15:08
@dcherian dcherian marked this pull request as ready for review August 21, 2024 16:27
@dcherian dcherian requested a review from max-sixty August 21, 2024 22:56
@dcherian
Copy link
Contributor Author

dcherian commented Aug 21, 2024

We should test grouping by a multi-index array and some other array but otherwise this should be good to go.

EDIT: Now raises NotImplementedError for this case.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge! Thank you very much @dcherian !

@dcherian
Copy link
Contributor Author

Merging so it's ready to try out!

@dcherian dcherian merged commit 19a0428 into pydata:main Aug 26, 2024
@dcherian dcherian deleted the multiple-groupers-2 branch August 26, 2024 15:56
@max-sixty
Copy link
Collaborator

Huge success! Thank you @dcherian !

@shoyer
Copy link
Member

shoyer commented Aug 28, 2024

Wow, amazing work @dcherian! I am so excited to see this finally land.

hollymandel pushed a commit to hollymandel/xarray that referenced this pull request Aug 30, 2024
* GroupBy(multiple groupers)

* Add example to docs

fix docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs

* More docs

* fix doc

* fix doc again

* Fix bug.

* Add whats-new note

* edit

* Error on multi-variable groupby with MultiIndex

* Update doc/user-guide/groupby.rst

---------

Co-authored-by: Maximilian Roos <m@maxroos.com>
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 30, 2024
* main:
  Accessibility: Add keyboard handling for XArray HTML view (pydata#9412)
  [pre-commit.ci] pre-commit autoupdate (pydata#9316)
  [skip-ci] Speed up docs build by limiting toctrees (pydata#9395)
  fix the failing `pre-commit.ci` runs (pydata#9411)
  Update benchmarks.yml (pydata#9406)
  GroupBy(multiple groupers) (pydata#9372)
  Encode/decode property tests use variables() (pydata#9401)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable multi-coord grouping from xarray groupby_bins along two dims simultaneously

3 participants