Investigate Dask for speeding up Zheng17

Can we use Dask to speed up the preprocessing phase of Scanpy by taking advantage of multiple CPUs (or GPUs)?

**TLDR**: Dask can speed up Zheng17, but it needs lots of cores and a rewritten implementation. CuPy (for GPUs) has missing operations required by Zheng17, so more work is needed for Dask with GPUs.

### Investigation

Dask is mainly used with dense arrays, however the arrays in Scanpy are sparse (for most of the preprocessing phase at least). I tried looking at [pydata sparse](https://sparse.pydata.org/en/latest/) with Dask, but it ran a lot slower than regular [`scipy.sparse`](https://docs.scipy.org/doc/scipy/reference/sparse.html) (which is what Scanpy uses).

So I wrote a [wrapper](https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/sparsearray/_scipy_sparse.py) around `scipy.sparse` to implement NumPy's `__array_function__` protocol. This allows sparse arrays to be chunks in a Dask array. This approach seemed promising, with basic operations able to take take advantage of multiple cores and run faster than regular `scipy.sparse`.

However, when I first tried running the whole Zheng17 recipe, `scipy.sparse` was always faster than Dask with `scipy.sparse`, even with many cores (e.g. 64).

It turned out that by using Anndata arrays, Dask has to materialize intermediate data more than is necessary in order to populate the Anndata metadata. This is because the way Anndata works means that its metadata must be computed eagerly after each operation in the Zheng17 recipe, rather than lazily for the whole computation (which is the way Dask works).

To avoid this complication I rewrote the Zheng17 recipe to do all the NumPy array computations and then construct an Anndata representation at the end,
to take advantage of Dask's deferred processing of lazy values. (See https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/preprocessing/_dask_optimized.py#L115 for the code.)

With this change, running on the 1M neurons dataset with 64 cores `scipy.sparse` takes 334s, while Dask with `scipy.sparse` takes 138s, a 2.4x speedup.

That's a significant speedup, but I'm not sure that it justifies the code overhead. I'd be interested to hear what others think. 

### Other notes

#### Code
See this branch: https://github.com/theislab/scanpy/compare/master...tomwhite:sparse-dask

#### CuPy and GPUs
I also wrote a [wrapper](https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/sparsearray/_cupy_sparse.py) around the GPU equivalent of `scipy.sparse`, [`cupyx.scipy.sparse`](https://docs-cupy.chainer.org/en/stable/reference/sparse.html).

Many operations work, however `cupyx.scipy.sparse` has a number of missing features that mean it can’t be used for Zheng17 yet. It would require significant work in CuPy to get it working:
* `multiply` - not implemented by `cupyx.scipy.sparse.csr.csr_matrix`
* `mean` - no method on `cupyx.scipy.sparse.csr.csr_matrix` (note that it does have `sum`)
* column subset not supported, e.g. `xs[:, 1:3]` (note that row subset is)
* boolean indexing, i.e. `xs[:, subset]`, where `subset` is e.g. `np.array([True, False, True, False, True])`; note this fails for rows too

#### NumPy 1.16 vs NumPy 1.17
I used NumPy 1.16 for the above experiments. However, when I tried NumPy 1.17 the Dask implementation slowed down significantly. I haven't been able to pinpoint the issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate Dask for speeding up Zheng17 #921

Investigation

Other notes

Code

CuPy and GPUs

NumPy 1.16 vs NumPy 1.17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate Dask for speeding up Zheng17 #921

Description

Investigation

Other notes

Code

CuPy and GPUs

NumPy 1.16 vs NumPy 1.17

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions