You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Second, `method="cohorts"` which is a bit more subtle.
47
52
Consider `groupby("time.month")` for the monthly mean dataset i.e. grouping by an exactly periodic array.
48
53
When the chunk size along the core dimension "time" is a divisor of the period; so either 1, 2, 3, 4, or 6 in this case; groups tend to occur in cohorts ("groups of groups").
49
54
For example, with a chunk size of 4, monthly mean input data for Jan, Feb, Mar, and April ("one cohort") are _always_ in the same chunk, and totally separate from any of the other months.
This means that we can run the tree reduction for each cohort (three cohorts in total: `JFMA | MJJA | SOND`) independently and expose more parallelism.
51
57
Doing so can significantly reduce compute times and in particular memory required for the computation.
52
58
53
59
Importantly if there isn't much separation of groups into cohorts; example, the groups are randomly distributed, then we'd like the standard `method="map-reduce"` for low overhead.
54
60
55
-
## Choosing a strategy is hard, and hard to teach.
61
+
## Choosing a strategy is hard, and harder to teach.
56
62
57
63
These strategies are great, but the downside is some sophistication is required to apply them.
58
64
Worse, they are hard to explain conceptually! I've tried! ([example 1](https://discourse.pangeo.io/t/optimizing-climatology-calculation-with-xarray-and-dask/2453/20?u=dcherian), [example 2](https://discourse.pangeo.io/t/understanding-optimal-zarr-chunking-scheme-for-a-climatology/2335)).
@@ -104,16 +110,12 @@ The steps are as follows:
104
110
1. First determine which labels are present in each chunk. The distribution of labels across chunks
105
111
is represented internally as a 2D boolean sparse array `S[chunks, labels]`. `S[i, j] = 1` when
106
112
label `j` is present in chunk `i`.
107
-
108
113
1. Now we can quickly determine a number of special cases:
109
-
110
114
1. Use `"blockwise"` when every group is contained to one block each.
111
115
1. Use `"cohorts"` when every chunk only has a single group, but that group might extend across multiple chunks
1. At this point, we want to merge groups in to cohorts when they occupy _approximately_ the same chunks. For each group `i` we can quickly compute containment against
115
118
all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
116
-
117
119
1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
118
120
each other. We can use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`.
119
121
We use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`. When sparsity is relatively high, we use `"map-reduce"`, otherwise we use `"cohorts"`.
@@ -131,6 +133,9 @@ flox will choose:
131
133
132
134
Cool, isn't it?!
133
135
136
+
Importantly this inference is fast — 400ms for the [US county GroupBy problem in our previous post](https://xarray.dev/blog/flox)!
137
+
But we have not tried with bigger problems (example: GroupBy(100,000 watersheds) in the US).
138
+
134
139
## What's next?
135
140
136
141
flox' ability to do cool inferences entirely relies on the input chunking, which is a major user-tunable knob.
0 commit comments