Skip to content

Commit 1cc7eb5

Browse files
committed
edits
1 parent 5a2bb65 commit 1cc7eb5

File tree

1 file changed

+18
-22
lines changed

1 file changed

+18
-22
lines changed

src/posts/flox-smart/index.md

Lines changed: 18 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ With flox installed, Xarray instead uses its parallel-friendly tree reduction.
2626
In many cases, this is a massive improvement.
2727
Notice how much cleaner the graph is in this image:
2828
![map-reduce](https://flox.readthedocs.io/en/latest/_images/new-map-reduce-reindex-True-annotated.svg)
29-
See our [previous blog post](https://xarray.dev/blog/flox) for more.
29+
See our [previous blog post](https://xarray.dev/blog/flox) or the [docs](https://flox.readthedocs.io/en/latest/implementation.html#method-map-reduce) for more.
3030

3131
## Tree reductions can be catastrophically bad
3232

@@ -45,7 +45,7 @@ Thus `flox` quickly grew two new modes of computing the groupby reduction.
4545
Two key realizations influenced that development:
4646

4747
1. Array workloads frequently group by a relatively small in-memory array. Quite frequently those arrays have patterns to their values e.g. `"time.month"` is exactly periodic, `"time.dayofyear"` is approximately periodic (depending on calendar), `"time.year"` is commonly a monotonic increasing array.
48-
2. Chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation. This is an important difference between arrays and dataframes!
48+
2. Chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation. So a grouped reduction applied blockwise need not reduce the data by much (or any!) at all. This is an important difference between arrays and dataframes!
4949

5050
These two properties are particularly relevant for "climatology" calculations (e.g. `groupby("time.month").mean()`) — a common Xarray workload in the Earth Sciences.
5151

@@ -84,24 +84,24 @@ For groups `A,B,C,D` that occupy the following chunks (chunk 0 is the first chun
8484

8585
```
8686
A: [0, 1, 2]
87-
B: [1, 2, 3]
88-
D: [5, 6, 7, 8]
89-
E: [8]
90-
X: [0, 3]
87+
B: [1, 2, 3, 4]
88+
C: [5, 6, 7, 8]
89+
D: [8]
90+
X: [0, 4]
9191
```
9292

9393
We want to detect the cohorts `{A,B,X}` and `{C, D}` with the following chunks.
9494

9595
```
96-
[A, B, X]: [0, 1, 2, 3]
96+
[A, B, X]: [0, 1, 2, 3, 4]
9797
[C, D]: [5, 6, 7, 8]
9898
```
9999

100100
Importantly, we do _not_ want to be dependent on detecting exact patterns, and prefer approximate solutions and heuristics.
101101

102102
## The solution
103103

104-
After a fun exploration involving such fun ideas as [locality-sensitive hashing](http://ekzhu.com/datasketch/lshensemble.html), and [all-pair set similarity search](https://www.cse.unsw.edu.au/~lxue/WWW08.pdf), I settled on the following algorithm.
104+
After a fun exploration involving such fun ideas as [locality-sensitive hashing](http://ekzhu.com/datasketch/lshensemble.html), and [all-pairs set similarity search](https://www.cse.unsw.edu.au/~lxue/WWW08.pdf), I settled on the following algorithm.
105105

106106
I use set _containment_, or a "normalized intersection", to determine the similarity between the sets of chunks occupied by two different groups (`Q` and `X`).
107107

@@ -121,11 +121,11 @@ The steps are as follows:
121121
1. Now we can quickly determine a number of special cases and exit early:
122122
1. Use `"blockwise"` when every group is contained to one block each.
123123
1. Use `"cohorts"` when
124-
1. every chunk only has a single group, but that group might extend across multiple chunks; and
125-
1. existing cohorts don't overlap at all.
126-
1. [and more](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L408-L426)
124+
1. every chunk only has a single group, but that group might extend across multiple chunks; and
125+
1. existing cohorts don't overlap at all.
126+
1. [and more](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L408-L426)
127127

128-
If we reach here, then we want to merge together any detected cohorts that substantially overlap with each other.
128+
If we haven't exited yet, then we want to merge together any detected cohorts that substantially overlap with each other using the containment metric.
129129

130130
1. For that we first quickly compute containment for all groups `i` against all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
131131
1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
@@ -136,25 +136,21 @@ If we reach here, then we want to merge together any detected cohorts that subst
136136
For more detail [see the docs](https://flox.readthedocs.io/en/latest/implementation.html#heuristics) or [the code](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L336).
137137
Suggestions and improvements are very welcome!
138138

139-
Here is containment `C[i, j]` for a range of chunk sizes from 1 to 12, for an input array with 12 monthly mean time steps,
140-
for computing `groupby("time.month")` of a monthly mean dataset.
141-
These are colored so that light yellow is $C=0$, and dark purple is $C=1$.
142-
The title on each image is (chunk size, sparsity).
139+
Here is containment `C[i, j]` for a range of chunk sizes from 1 to 12 for computing `groupby("time.month")` of a monthly mean dataset.
140+
The images only show 12 time steps.
141+
These are colored so that light yellow is C=0, and dark purple is C=1.
143142
`C[i,j] = 1` when the chunks occupied by group `i` perfectly overlaps with those occupied by group `j` (so the diagonal elements
144143
are always 1).
144+
The title on each image is `(chunk size, sparsity)`.
145145
When the chunksize _is_ a divisor of the period 12, $C$ is a [block diagonal](https://en.wikipedia.org/wiki/Block_matrix) matrix.
146146
When the chunksize _is not_ a divisor of the period 12, $C$ is much less sparse in comparison.
147147
![flox sparsity image](https://flox.readthedocs.io/en/latest/_images/containment.png)
148148

149-
Given the above `C`, flox will choose:
150-
151-
1. `"blockwise"` for chunk size 1,
152-
2. `"cohorts"` for (2, 3, 4, 6, 12),
153-
3. and `"map-reduce"` for the rest.
149+
Given the above `C`, flox will choose `"cohorts"` for chunk sizes (1, 2, 3, 4, 6, 12), and `"map-reduce"` for the rest.
154150

155151
Cool, isn't it?!
156152

157-
Importantly this inference is fast — [400ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox)!
153+
Importantly this inference is fast — [400ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox) where the group labels are a 2GB 2D array!
158154
But we have not tried with bigger problems (example: GroupBy(100,000 watersheds) in the US).
159155

160156
## What's next?

0 commit comments

Comments
 (0)