-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Fast CSR loading for to_anndata
#83
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some style suggestions and related concerns here, but the overall structure of the code looks pretty good. No major changes, just some polishing.
@thetorpedodog - OK, I think I hit all of your (excellent) feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor, non-essential suggestions, which you are free to take or leave. looks good!
Still looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 🚢 🚢
indices = np.empty((nnz,), dtype=index_dtype) | ||
data = np.empty((nnz,), dtype=self.coo_chunks[0][2].dtype) | ||
|
||
# empirically determined value. Needs to be large enough for reasonable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-nit: capitalize Empirically at start of sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit, but super easy - fixed!
Related issue with scipy.sparse performance: scipy/scipy#11496 |
to_anndata
This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:
In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.
Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).
See also, related PR single-cell-data/TileDB-SOMA#745