Skip to content

Investigate Dask for speeding up Zheng17 #921

Open
@tomwhite

Description

@tomwhite

Can we use Dask to speed up the preprocessing phase of Scanpy by taking advantage of multiple CPUs (or GPUs)?

TLDR: Dask can speed up Zheng17, but it needs lots of cores and a rewritten implementation. CuPy (for GPUs) has missing operations required by Zheng17, so more work is needed for Dask with GPUs.

Investigation

Dask is mainly used with dense arrays, however the arrays in Scanpy are sparse (for most of the preprocessing phase at least). I tried looking at pydata sparse with Dask, but it ran a lot slower than regular scipy.sparse (which is what Scanpy uses).

So I wrote a wrapper around scipy.sparse to implement NumPy's __array_function__ protocol. This allows sparse arrays to be chunks in a Dask array. This approach seemed promising, with basic operations able to take take advantage of multiple cores and run faster than regular scipy.sparse.

However, when I first tried running the whole Zheng17 recipe, scipy.sparse was always faster than Dask with scipy.sparse, even with many cores (e.g. 64).

It turned out that by using Anndata arrays, Dask has to materialize intermediate data more than is necessary in order to populate the Anndata metadata. This is because the way Anndata works means that its metadata must be computed eagerly after each operation in the Zheng17 recipe, rather than lazily for the whole computation (which is the way Dask works).

To avoid this complication I rewrote the Zheng17 recipe to do all the NumPy array computations and then construct an Anndata representation at the end,
to take advantage of Dask's deferred processing of lazy values. (See https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/preprocessing/_dask_optimized.py#L115 for the code.)

With this change, running on the 1M neurons dataset with 64 cores scipy.sparse takes 334s, while Dask with scipy.sparse takes 138s, a 2.4x speedup.

That's a significant speedup, but I'm not sure that it justifies the code overhead. I'd be interested to hear what others think.

Other notes

Code

See this branch: master...tomwhite:sparse-dask

CuPy and GPUs

I also wrote a wrapper around the GPU equivalent of scipy.sparse, cupyx.scipy.sparse.

Many operations work, however cupyx.scipy.sparse has a number of missing features that mean it can’t be used for Zheng17 yet. It would require significant work in CuPy to get it working:

  • multiply - not implemented by cupyx.scipy.sparse.csr.csr_matrix
  • mean - no method on cupyx.scipy.sparse.csr.csr_matrix (note that it does have sum)
  • column subset not supported, e.g. xs[:, 1:3] (note that row subset is)
  • boolean indexing, i.e. xs[:, subset], where subset is e.g. np.array([True, False, True, False, True]); note this fails for rows too

NumPy 1.16 vs NumPy 1.17

I used NumPy 1.16 for the above experiments. However, when I tried NumPy 1.17 the Dask implementation slowed down significantly. I haven't been able to pinpoint the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions