Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Fast CSR loading for to_anndata #83

Merged
merged 14 commits into from
Jan 19, 2023
Merged

Conversation

bkmartinjr
Copy link
Member

@bkmartinjr bkmartinjr commented Jan 18, 2023

This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:

  • parallelized COO to CSR conversion
  • eager reading of the COO data from SOMA, allowing some concurrent COO-to-CSR work to overlap with the data reading.

In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.

Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).

See also, related PR single-cell-data/TileDB-SOMA#745

Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some style suggestions and related concerns here, but the overall structure of the code looks pretty good. No major changes, just some polishing.

python-spec/src/somacore/query/eager_iter.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
@bkmartinjr
Copy link
Member Author

@thetorpedodog - OK, I think I hit all of your (excellent) feedback.

Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor, non-essential suggestions, which you are free to take or leave. looks good!

python-spec/src/somacore/query/eager_iter.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved
python-spec/src/somacore/query/query.py Outdated Show resolved Hide resolved
@thetorpedodog
Copy link
Contributor

Still looks good.

pyproject.toml Show resolved Hide resolved
python-spec/requirements-py3.10.txt Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/eager_iter.py Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Outdated Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved
python-spec/src/somacore/query/fast_csr.py Show resolved Hide resolved
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🚢 🚢

indices = np.empty((nnz,), dtype=index_dtype)
data = np.empty((nnz,), dtype=self.coo_chunks[0][2].dtype)

# empirically determined value. Needs to be large enough for reasonable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-nit: capitalize Empirically at start of sentence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit, but super easy - fixed!

@bkmartinjr bkmartinjr merged commit 2704e1a into main Jan 19, 2023
@bkmartinjr bkmartinjr deleted the bkmartinjr/fast_to_anndata branch January 19, 2023 01:58
@bkmartinjr
Copy link
Member Author

Related issue with scipy.sparse performance: scipy/scipy#11496

@johnkerl johnkerl changed the title [python] Fast CSR loading for to_anndata [python] Fast CSR loading for to_anndata Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants