You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See our [previous blog post](https://xarray.dev/blog/flox) for more.
29
+
See our [previous blog post](https://xarray.dev/blog/flox)or the [docs](https://flox.readthedocs.io/en/latest/implementation.html#method-map-reduce)for more.
30
30
31
31
## Tree reductions can be catastrophically bad
32
32
@@ -45,7 +45,7 @@ Thus `flox` quickly grew two new modes of computing the groupby reduction.
45
45
Two key realizations influenced that development:
46
46
47
47
1. Array workloads frequently group by a relatively small in-memory array. Quite frequently those arrays have patterns to their values e.g. `"time.month"` is exactly periodic, `"time.dayofyear"` is approximately periodic (depending on calendar), `"time.year"` is commonly a monotonic increasing array.
48
-
2. Chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation. This is an important difference between arrays and dataframes!
48
+
2. Chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation. So a grouped reduction applied blockwise need not reduce the data by much (or any!) at all. This is an important difference between arrays and dataframes!
49
49
50
50
These two properties are particularly relevant for "climatology" calculations (e.g. `groupby("time.month").mean()`) — a common Xarray workload in the Earth Sciences.
51
51
@@ -84,24 +84,24 @@ For groups `A,B,C,D` that occupy the following chunks (chunk 0 is the first chun
84
84
85
85
```
86
86
A: [0, 1, 2]
87
-
B: [1, 2, 3]
88
-
D: [5, 6, 7, 8]
89
-
E: [8]
90
-
X: [0, 3]
87
+
B: [1, 2, 3, 4]
88
+
C: [5, 6, 7, 8]
89
+
D: [8]
90
+
X: [0, 4]
91
91
```
92
92
93
93
We want to detect the cohorts `{A,B,X}` and `{C, D}` with the following chunks.
94
94
95
95
```
96
-
[A, B, X]: [0, 1, 2, 3]
96
+
[A, B, X]: [0, 1, 2, 3, 4]
97
97
[C, D]: [5, 6, 7, 8]
98
98
```
99
99
100
100
Importantly, we do _not_ want to be dependent on detecting exact patterns, and prefer approximate solutions and heuristics.
101
101
102
102
## The solution
103
103
104
-
After a fun exploration involving such fun ideas as [locality-sensitive hashing](http://ekzhu.com/datasketch/lshensemble.html), and [all-pair set similarity search](https://www.cse.unsw.edu.au/~lxue/WWW08.pdf), I settled on the following algorithm.
104
+
After a fun exploration involving such fun ideas as [locality-sensitive hashing](http://ekzhu.com/datasketch/lshensemble.html), and [all-pairs set similarity search](https://www.cse.unsw.edu.au/~lxue/WWW08.pdf), I settled on the following algorithm.
105
105
106
106
I use set _containment_, or a "normalized intersection", to determine the similarity between the sets of chunks occupied by two different groups (`Q` and `X`).
107
107
@@ -121,11 +121,11 @@ The steps are as follows:
121
121
1. Now we can quickly determine a number of special cases and exit early:
122
122
1. Use `"blockwise"` when every group is contained to one block each.
123
123
1. Use `"cohorts"` when
124
-
1. every chunk only has a single group, but that group might extend across multiple chunks; and
If we reach here, then we want to merge together any detected cohorts that substantially overlap with each other.
128
+
If we haven't exited yet, then we want to merge together any detected cohorts that substantially overlap with each other using the containment metric.
129
129
130
130
1. For that we first quickly compute containment for all groups `i` against all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
131
131
1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
@@ -136,25 +136,21 @@ If we reach here, then we want to merge together any detected cohorts that subst
136
136
For more detail [see the docs](https://flox.readthedocs.io/en/latest/implementation.html#heuristics) or [the code](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L336).
137
137
Suggestions and improvements are very welcome!
138
138
139
-
Here is containment `C[i, j]` for a range of chunk sizes from 1 to 12, for an input array with 12 monthly mean time steps,
140
-
for computing `groupby("time.month")` of a monthly mean dataset.
141
-
These are colored so that light yellow is $C=0$, and dark purple is $C=1$.
142
-
The title on each image is (chunk size, sparsity).
139
+
Here is containment `C[i, j]` for a range of chunk sizes from 1 to 12 for computing `groupby("time.month")` of a monthly mean dataset.
140
+
The images only show 12 time steps.
141
+
These are colored so that light yellow is C=0, and dark purple is C=1.
143
142
`C[i,j] = 1` when the chunks occupied by group `i` perfectly overlaps with those occupied by group `j` (so the diagonal elements
144
143
are always 1).
144
+
The title on each image is `(chunk size, sparsity)`.
145
145
When the chunksize _is_ a divisor of the period 12, $C$ is a [block diagonal](https://en.wikipedia.org/wiki/Block_matrix) matrix.
146
146
When the chunksize _is not_ a divisor of the period 12, $C$ is much less sparse in comparison.
Given the above `C`, flox will choose `"cohorts"` for chunk sizes (1, 2, 3, 4, 6, 12), and `"map-reduce"` for the rest.
154
150
155
151
Cool, isn't it?!
156
152
157
-
Importantly this inference is fast — [400ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox)!
153
+
Importantly this inference is fast — [400ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox) where the group labels are a 2GB 2D array!
158
154
But we have not tried with bigger problems (example: GroupBy(100,000 watersheds) in the US).
0 commit comments