Skip to content

Commit 0c6e212

Browse files
committed
more edits
1 parent d7eb21d commit 0c6e212

File tree

1 file changed

+10
-5
lines changed

1 file changed

+10
-5
lines changed

src/posts/flox-smart/index.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ authors:
88
summary: 'flox adds heuristics for automatically choosing an appropriate strategy with dask arrays!'
99
---
1010

11+
## TL;DR
12+
13+
`flox>=0.9` adds heuristics for automatically choosing an appropriate strategy with dask arrays! Here I describe how.
14+
1115
## What is flox?
1216

1317
[`flox` implements](https://flox.readthedocs.io/) grouped reductions for chunked array types like cubed and dask using a tree reduction approach.
@@ -42,17 +46,19 @@ Thus `flox` quickly grew two new modes of computing the groupby reduction.
4246
First, `method="blockwise"` which applies the grouped-reduction in a blockwise fashion.
4347
This is great for `resample(time="Y").mean()` where we group by `"time.year"`, which is a monotonic increasing array.
4448
With an appropriate (and usually quite cheap) rechunking, the problem is embarassingly parallel.
49+
![blockwise](https://flox.readthedocs.io/en/latest/_images/new-blockwise-annotated.svg)
4550

4651
Second, `method="cohorts"` which is a bit more subtle.
4752
Consider `groupby("time.month")` for the monthly mean dataset i.e. grouping by an exactly periodic array.
4853
When the chunk size along the core dimension "time" is a divisor of the period; so either 1, 2, 3, 4, or 6 in this case; groups tend to occur in cohorts ("groups of groups").
4954
For example, with a chunk size of 4, monthly mean input data for Jan, Feb, Mar, and April ("one cohort") are _always_ in the same chunk, and totally separate from any of the other months.
55+
![monthly cohorts](https://flox.readthedocs.io/en/latest/_images/cohorts-month-chunk4.png)
5056
This means that we can run the tree reduction for each cohort (three cohorts in total: `JFMA | MJJA | SOND`) independently and expose more parallelism.
5157
Doing so can significantly reduce compute times and in particular memory required for the computation.
5258

5359
Importantly if there isn't much separation of groups into cohorts; example, the groups are randomly distributed, then we'd like the standard `method="map-reduce"` for low overhead.
5460

55-
## Choosing a strategy is hard, and hard to teach.
61+
## Choosing a strategy is hard, and harder to teach.
5662

5763
These strategies are great, but the downside is some sophistication is required to apply them.
5864
Worse, they are hard to explain conceptually! I've tried! ([example 1](https://discourse.pangeo.io/t/optimizing-climatology-calculation-with-xarray-and-dask/2453/20?u=dcherian), [example 2](https://discourse.pangeo.io/t/understanding-optimal-zarr-chunking-scheme-for-a-climatology/2335)).
@@ -104,16 +110,12 @@ The steps are as follows:
104110
1. First determine which labels are present in each chunk. The distribution of labels across chunks
105111
is represented internally as a 2D boolean sparse array `S[chunks, labels]`. `S[i, j] = 1` when
106112
label `j` is present in chunk `i`.
107-
108113
1. Now we can quickly determine a number of special cases:
109-
110114
1. Use `"blockwise"` when every group is contained to one block each.
111115
1. Use `"cohorts"` when every chunk only has a single group, but that group might extend across multiple chunks
112116
1. [and more](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L408-L426)
113-
114117
1. At this point, we want to merge groups in to cohorts when they occupy _approximately_ the same chunks. For each group `i` we can quickly compute containment against
115118
all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
116-
117119
1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
118120
each other. We can use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`.
119121
We use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`. When sparsity is relatively high, we use `"map-reduce"`, otherwise we use `"cohorts"`.
@@ -131,6 +133,9 @@ flox will choose:
131133

132134
Cool, isn't it?!
133135

136+
Importantly this inference is fast — 400ms for the [US county GroupBy problem in our previous post](https://xarray.dev/blog/flox)!
137+
But we have not tried with bigger problems (example: GroupBy(100,000 watersheds) in the US).
138+
134139
## What's next?
135140

136141
flox' ability to do cool inferences entirely relies on the input chunking, which is a major user-tunable knob.

0 commit comments

Comments
 (0)