[Fail case] Almost-blockwise weighted arithmetic vorticity calculation

### Motivation

@rabernat and I have a huge oceanographic dataset ([LLC4320](https://catalog.pangeo.io/browse/master/ocean/LLC4320/)) with surface velocity components `u` and `v`. We want to calculate the vorticity from these using `u*dx - v*dy`, which takes into account the size of the grid cells in the x and y directions `dx` and `dy`.

This is a specific example of a generic pattern in geoscience / fluid dynamics: apply a simple arithmetic operation weighted by some lower-dimensional constants. In our case we are approximating a differential operator but you might see this pattern in other cases such as weighted averages.


### Full problem statement

The full problem we were trying to solve has an additional feature - a rechunking step where each variable gets one more element appended to the end of the array, and then immediately rechunked to merge that length 1 chunk back into the final chunk of the array. This is done by [xGCM](https://github.com/xgcm/xgcm) to deal with boundary conditions, but the only part relevant for this issue is that it's practically a no-op but is nevertheless enough to mean the overall graph is technically no longer "blockwise".

### What specifically goes wrong

This computation displays a number of failure modes when executed with `dask.distributed`.

- **Too many tasks** - Doing anything with this LLC4320 dataset immediately involves hundreds of thousands of tasks, just because even one variable has 117390 chunks. This is not really going to scale much further.

- **Memory blowup** - This is the big one, where the non-blockwiseness causes `distributed` to load chunks in a non-streaming fashion, so they consume memory faster than they are saved out, blowing up the memory overall. This problem is referred to as "root task overproduction" by Gabe Joseph, who has implemented some potential fixes to `dask.distributed`. However it's also arguably a symptom of a fundamental problem with dask not knowing enough about what happens in each task, and therefore being unable to make any guarantees about memory usage.

- **Widely-shared dependencies** - `dx * u` is an extremely one-to-many, where every chunk of `u` has to have access to `dx`. At large scale scheduling this can become a bottleneck for `distributed` (in a way I don't fully understand).

- **Root-task co-location** - For an in-memory executor like dask, corresponding chunks of `u` and `v` should be co-located on the same processor, to improve performance and avoid a huge amount of unnecessary communication.

### General considerations

I think this problem is a useful benchmark because it exposes multiple failure modes of the dask.distributed scheduler at once, whilst also allowing you to switch any of these problems off or on by slight changes to the problem definition.

- Starting from `(u - u)`, this would still be called via `apply_ufunc` in xarray, and will produce an `O(N)` size task graph in dask. (It would also be nice if a high-level graph abstraction knew that this was a no-op...)

- `(u - v)` introduces the problem of "root task co-location", where you want the scheduler to know that corresponding chunks of `u` and `v` should be co-located on the same node from the start in order to avoid a lot of unnecessary communication.

- `(u*dx - v*dy)` now has the "widely-shared dependencies" problem, where `dx` and `dy` are much smaller than `u` and `v`, but are required by all chunks of `u` and `v`.

- Same again but with the rechunk step inserted as above is now basically the same graph, and should be evaluated in the same way, but has enough of a difference to no longer technically be blockwise. This non-blockwiseness is what causes distributed to fall back on the flawed algorithm that displays the "root task overproduction" problem.

### Links

- Scipy talk [slides](https://speakerdeck.com/tomnicholas/scipy-2022-can-we-analyse-the-largest-ocean-simulation-ever?slide=17) and [video](https://www.youtube.com/watch?v=ftlgOESINvo)
- [Notebook](https://gist.github.com/TomNicholas/6cbf3ad553e64a0a6b749e28813d4f6f) from this dask issue https://github.com/dask/distributed/issues/6571
- Various other dask distributed issues


### Benchmarks

- dask
  - Full problem on original `dask.distributed` performs very poorly
  - Might be a lot better after Gabe's fixes but I haven't tried this yet
  - We actually hacked a workaround doing it in an embarrassingly-parallel fashion using `dask.delayed` for the scipy talk. (link) This involved pre-loading `dx` and `dy` into memory.
- cubed
  - Haven't tried yet
  - Do we need `apply_gufunc` to express this in cubed? Maybe not... https://github.com/tomwhite/cubed/issues/67
- xarray-beam
  - Haven't tried yet
  - Don't think it can be expressed in xarray-beam easily
  - But the delayed workaround is very much like a beam calculation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fail case] Almost-blockwise weighted arithmetic vorticity calculation #1

Motivation

Full problem statement

What specifically goes wrong

General considerations

Links

Benchmarks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Fail case] Almost-blockwise weighted arithmetic vorticity calculation #1

Description

Motivation

Full problem statement

What specifically goes wrong

General considerations

Links

Benchmarks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions