Description
Can we use Dask to speed up the preprocessing phase of Scanpy by taking advantage of multiple CPUs (or GPUs)?
TLDR: Dask can speed up Zheng17, but it needs lots of cores and a rewritten implementation. CuPy (for GPUs) has missing operations required by Zheng17, so more work is needed for Dask with GPUs.
Investigation
Dask is mainly used with dense arrays, however the arrays in Scanpy are sparse (for most of the preprocessing phase at least). I tried looking at pydata sparse with Dask, but it ran a lot slower than regular scipy.sparse
(which is what Scanpy uses).
So I wrote a wrapper around scipy.sparse
to implement NumPy's __array_function__
protocol. This allows sparse arrays to be chunks in a Dask array. This approach seemed promising, with basic operations able to take take advantage of multiple cores and run faster than regular scipy.sparse
.
However, when I first tried running the whole Zheng17 recipe, scipy.sparse
was always faster than Dask with scipy.sparse
, even with many cores (e.g. 64).
It turned out that by using Anndata arrays, Dask has to materialize intermediate data more than is necessary in order to populate the Anndata metadata. This is because the way Anndata works means that its metadata must be computed eagerly after each operation in the Zheng17 recipe, rather than lazily for the whole computation (which is the way Dask works).
To avoid this complication I rewrote the Zheng17 recipe to do all the NumPy array computations and then construct an Anndata representation at the end,
to take advantage of Dask's deferred processing of lazy values. (See https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/preprocessing/_dask_optimized.py#L115 for the code.)
With this change, running on the 1M neurons dataset with 64 cores scipy.sparse
takes 334s, while Dask with scipy.sparse
takes 138s, a 2.4x speedup.
That's a significant speedup, but I'm not sure that it justifies the code overhead. I'd be interested to hear what others think.
Other notes
Code
See this branch: master...tomwhite:sparse-dask
CuPy and GPUs
I also wrote a wrapper around the GPU equivalent of scipy.sparse
, cupyx.scipy.sparse
.
Many operations work, however cupyx.scipy.sparse
has a number of missing features that mean it can’t be used for Zheng17 yet. It would require significant work in CuPy to get it working:
multiply
- not implemented bycupyx.scipy.sparse.csr.csr_matrix
mean
- no method oncupyx.scipy.sparse.csr.csr_matrix
(note that it does havesum
)- column subset not supported, e.g.
xs[:, 1:3]
(note that row subset is) - boolean indexing, i.e.
xs[:, subset]
, wheresubset
is e.g.np.array([True, False, True, False, True])
; note this fails for rows too
NumPy 1.16 vs NumPy 1.17
I used NumPy 1.16 for the above experiments. However, when I tried NumPy 1.17 the Dask implementation slowed down significantly. I haven't been able to pinpoint the issue.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status