These are both entirely lazy operations:
>>> %time res = ds.groupby('time.month').mean('time')
CPU times: user 142 ms, sys: 20.3 ms, total: 162 ms
Wall time: 159 ms
>>> %time res = ds.groupby('time.month').apply(lambda x: x - x.mean())
CPU times: user 46.1 s, sys: 4.9 s, total: 51 s
Wall time: 50.4 s
I suspect the issue (in part) is that _interleaved_concat_slow indexes out single elements from each dask array along the grouped axis prior to concatenating them together (unit tests for interleaved_concat can be found here). So we end up creating way too many small dask arrays.
Profiling results on slightly smaller data are in this gist.
It would be great if we could figure out a way to make this faster, because these sort of operations are a really nice show case for xray + dask.
CC @mrocklin in case you have any ideas.