Non-aggregating grouped operations on dask arrays are painfully slow to construct

These are both entirely lazy operations:

```
>>> %time res = ds.groupby('time.month').mean('time')
CPU times: user 142 ms, sys: 20.3 ms, total: 162 ms
Wall time: 159 ms
>>> %time res = ds.groupby('time.month').apply(lambda x: x - x.mean())
CPU times: user 46.1 s, sys: 4.9 s, total: 51 s
Wall time: 50.4 s
```

I suspect the issue (in part) is that [_interleaved_concat_slow](https://github.com/xray/xray/blob/e22468f51c2b6ceb0fc4d71b657ee64d0a0c315b/xray/core/ops.py#L113) indexes out single elements from each dask array along the grouped axis prior to concatenating them together (unit tests for interleaved_concat can be found [here](https://github.com/xray/xray/blob/e22468f51c2b6ceb0fc4d71b657ee64d0a0c315b/xray/test/test_ops.py#L81)). So we end up creating way too many small dask arrays.

Profiling results on slightly smaller data are in [this gist](https://gist.github.com/shoyer/bfdda77549dcead3e996).

It would be great if we could figure out a way to make this faster, because these sort of operations are a really nice show case for xray + dask.

CC @mrocklin in case you have any ideas.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Non-aggregating grouped operations on dask arrays are painfully slow to construct #392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Non-aggregating grouped operations on dask arrays are painfully slow to construct #392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions