New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[python] Fast CSR loading for `to_anndata` #83

Merged

bkmartinjr merged 14 commits into main from bkmartinjr/fast_to_anndata

Jan 19, 2023

Member

bkmartinjr commented Jan 18, 2023 •

edited by johnkerl

Loading

This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:

parallelized COO to CSR conversion
eager reading of the COO data from SOMA, allowing some concurrent COO-to-CSR work to overlap with the data reading.

In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.

Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).

See also, related PR single-cell-data/TileDB-SOMA#745

bkmartinjr added 8 commits

January 18, 2023 17:44


          fast CSR implementation

a42cdb6


          Merge branch 'main' into bkmartinjr/fast_to_anndata

36823ca


          lint

4a54f48


          add numba to deps

da4e1e8


          add missing numba dependency

ae47897


          add numba to list of ignored modules for mypy

aa864cc


          do not subscript type signatures

54c68d2


          merge with main

8f5bacf

bkmartinjr marked this pull request as ready for review

January 18, 2023 18:39

bkmartinjr mentioned this pull request

[python] Prep for ExperimentAxisQuery perf work in somacore single-cell-data/TileDB-SOMA#745

Merged

bkmartinjr requested review from thetorpedodog, johnkerl and gspowley

January 18, 2023 18:41

thetorpedodog reviewed

View reviewed changes

Contributor

thetorpedodog left a comment

I have some style suggestions and related concerns here, but the overall structure of the code looks pretty good. No major changes, just some polishing.

python-spec/src/somacore/query/eager_iter.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

bkmartinjr added 2 commits

January 18, 2023 20:11


          PR feedback

31bd218


          more PR changes

79dd025

Member Author

bkmartinjr commented Jan 18, 2023

@thetorpedodog - OK, I think I hit all of your (excellent) feedback.

bkmartinjr requested a review from thetorpedodog

January 18, 2023 20:16

thetorpedodog approved these changes

View reviewed changes

Contributor

thetorpedodog left a comment

a few minor, non-essential suggestions, which you are free to take or leave. looks good!

python-spec/src/somacore/query/eager_iter.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved

python-spec/src/somacore/query/query.py Outdated Show resolved Hide resolved

bkmartinjr added 2 commits

January 18, 2023 20:55


          more PR review changes

b7976a0


          pr feedback

2905edc

Contributor

thetorpedodog commented Jan 18, 2023

Still looks good.

johnkerl mentioned this pull request

[c++/python] Optimize CSR I/O single-cell-data/TileDB-SOMA#718

Closed

johnkerl requested changes

View reviewed changes

pyproject.toml Show resolved Hide resolved

python-spec/requirements-py3.10.txt Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/eager_iter.py Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved

python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved


          more changes for PR review

883210e

bkmartinjr requested a review from johnkerl

January 19, 2023 01:48

johnkerl approved these changes

View reviewed changes

Member

johnkerl left a comment

🚢 🚢 🚢

python-spec/src/somacore/query/fast_csr.py Outdated

+                      indices = np.empty((nnz,), dtype=index_dtype)
+                      data = np.empty((nnz,), dtype=self.coo_chunks[0][2].dtype)
+                      # empirically determined value. Needs to be large enough for reasonable

Member

johnkerl Jan 19, 2023

Super-nit: capitalize Empirically at start of sentence

Member Author

bkmartinjr Jan 19, 2023

super nit, but super easy - fixed!


          typo

2ce3f5b

bkmartinjr merged commit 2704e1a into main

bkmartinjr deleted the bkmartinjr/fast_to_anndata branch

January 19, 2023 01:58

bkmartinjr mentioned this pull request

[python] experiment.axis_query().to_anndata() is slow (CSR opportunity) single-cell-data/TileDB-SOMA#719

Closed

Member Author

bkmartinjr commented Jan 19, 2023

Related issue with scipy.sparse performance: scipy/scipy#11496

johnkerl changed the title ~~[python] Fast CSR loading for to_anndata~~ [python] Fast CSR loading for to_anndata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet