-
Notifications
You must be signed in to change notification settings - Fork 706
(feat): sc.get.aggregate via dask
#3700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dasksc.get.aggregate via dask
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3700 +/- ##
=======================================
Coverage ? 76.00%
=======================================
Files ? 117
Lines ? 12600
Branches ? 0
=======================================
Hits ? 9576
Misses ? 3024
Partials ? 0
|
|
@Intron7 It won't let me request your review, but you're welcome to have a look |
|
TODOs:
|
|
I added @Intron7 to https://github.com/orgs/scverse/teams/scanpy, now you can request his review (I also did so) |
flying-sheep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I’m not 100% sure how it works though.
Could you explain how it works in respect to using the categories in the chunking? Are you making heterogeneous chunks to collect each subset according to a category?
src/scanpy/get/_aggregated.py
Outdated
| dtype=np.float64 | ||
| if func not in ["count_nonzero", "sum"] | ||
| else data.dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this info be centralized? this looks fragile because it’s an inline check based on the currently available functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really should be a TODO. To be correct, we would need to handle the overflow potential of sum i.e., jumping from uint16 to uint32 or something because the summation gets bigger. I wonder how dask or fau handles this internally? For example https://github.com/scverse/fast-array-utils/blob/b3c7c9829b2fe4e6d23915a2c6bab1b025484b10/src/fast_array_utils/stats/_sum.py#L89-L98 actually seems wrong for this exact reason, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I still stand by my comment. Please don’t bi-partition an extensible set by specifying one partition. This should be a lookup, either central or by defining and accessing an inline dict/switch-case that spells out all options.
regarding the rest: hmm, I guess I conceived fau to be more low-level, i.e. if you fear that something can overflow, set dtype manually, otherwise it preserves wherever possible, e.g. it doesn’t make sense to preserve int dtypes in mean, but fau doesn’t know what how big the numbers you sum up are so it makes no assumptions.
does that make sense or do you think it should be more opinionated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, the two things are separate. I will add a TODO as well then. Have a look and let me know if this is what you had in mind or if I misunderstood
Co-authored-by: Philipp A. <flying-sheep@web.de>
The idea is that we go row_chunk-by-row_chunk and aggregate those individually before doing a sum-like operation over all of them to get the final full-dataset aggregation. For example, for So each This architecture allows us to reuse our already existing implementation as a subroutine and paves the way hopefully for #3723 applied here as well where we just have an additional Does this clarify things? |
src/scanpy/get/_aggregated.py
Outdated
| dtype=np.float64 | ||
| if func not in ["count_nonzero", "sum"] | ||
| else data.dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I still stand by my comment. Please don’t bi-partition an extensible set by specifying one partition. This should be a lookup, either central or by defining and accessing an inline dict/switch-case that spells out all options.
regarding the rest: hmm, I guess I conceived fau to be more low-level, i.e. if you fear that something can overflow, set dtype manually, otherwise it preserves wherever possible, e.g. it doesn’t make sense to preserve int dtypes in mean, but fau doesn’t know what how big the numbers you sum up are so it makes no assumptions.
does that make sense or do you think it should be more opinionated?
Co-authored-by: Philipp A. <flying-sheep@web.de>
flying-sheep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful!
Co-authored-by: Philipp A. <flying-sheep@web.de>
See #3723 for a follow-up where we would go algorithm-by-algorithm and implement column-wise operations as well
sc.get.aggregatewith dask #3659