Skip to content

Using the shuffle primitive in Xarray #9546

Open
@dcherian

Description

@dcherian

Is your feature request related to a problem?

dask recently added dask.array.shuffle to help with some classic GroupBy.map problems.

shuffle reorders the array so that all members of a single group are in a single chunk, with the possibility of multiple groups in a single chunk. I see a few ways to use this in Xarray:

  1. GroupBy.shuffle() This shuffles and returns a new GroupBy object with which to do further operations (e.g. map).
  2. Dataset.shuffle_by(Grouper) This shuffles, and returns a new dataset (or dataarray), so that the shuffled data can be persisted to disk or you can do other things later (xref Saving the groups generated from groupby operation #5674)
  3. Use GroupBy.shuffle under the hood in DatasetGroupBy.quantile and DatasetGroupBy.median, so that the exact quantile always works regardless of chunking (right now we raise and error), this seems like a no-brainer.
  4. Add either a shuffle kwarg to GroupBy.map and/or GroupBy.reduce or a new API (e.g. GroupBy.transform or GroupBy.map_shuffled) that will shuffle, then xarray.map_blocks a wrapper function that applies the Groupby on each block. This is how dask dataframe implements Groupby.apply

#9320 implements (1,2). (1) is mostly for convenience, I could easily see us recommending using (2) before calling the GroupBy.

Thoughts?

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions