Skip to content

Conversation

@ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Jul 3, 2025

See #3723 for a follow-up where we would go algorithm-by-algorithm and implement column-wise operations as well

  • Release notes not necessary because:

@ilan-gold ilan-gold changed the title (feat): aggregate via dask (feat): sc.get.aggregate via dask Jul 3, 2025
@codecov
Copy link

codecov bot commented Jul 3, 2025

Codecov Report

❌ Patch coverage is 85.71429% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@77dd43c). Learn more about missing BASE report.
⚠️ Report is 42 commits behind head on main.

Files with missing lines Patch % Lines
src/scanpy/get/_aggregated.py 84.90% 8 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3700   +/-   ##
=======================================
  Coverage        ?   76.00%           
=======================================
  Files           ?      117           
  Lines           ?    12600           
  Branches        ?        0           
=======================================
  Hits            ?     9576           
  Misses          ?     3024           
  Partials        ?        0           
Files with missing lines Coverage Δ
src/scanpy/preprocessing/_qc.py 95.80% <100.00%> (ø)
src/scanpy/get/_aggregated.py 90.04% <84.90%> (ø)

@ilan-gold ilan-gold added this to the 1.12.0 milestone Jul 4, 2025
@ilan-gold
Copy link
Contributor Author

@Intron7 It won't let me request your review, but you're welcome to have a look

@ilan-gold
Copy link
Contributor Author

TODOs:

  • CSC aggregation, i.e., aggregate by groups of features and then concatenate
  • Similar to the above, try out adding (or reusing) some sort of chunks parameter that allows doing the aggregation on a subset of the requested groupby-categories and then concatenate at the end to potentially memory.

@ilan-gold ilan-gold requested a review from flying-sheep July 21, 2025 09:11
@ilan-gold ilan-gold mentioned this pull request Jul 21, 2025
4 tasks
@ilan-gold ilan-gold marked this pull request as ready for review July 21, 2025 09:24
@flying-sheep flying-sheep requested a review from Intron7 July 21, 2025 12:58
@flying-sheep
Copy link
Member

I added @Intron7 to https://github.com/orgs/scverse/teams/scanpy, now you can request his review (I also did so)

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I’m not 100% sure how it works though.

Could you explain how it works in respect to using the categories in the chunking? Are you making heterogeneous chunks to collect each subset according to a category?

Comment on lines 401 to 403
dtype=np.float64
if func not in ["count_nonzero", "sum"]
else data.dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this info be centralized? this looks fragile because it’s an inline check based on the currently available functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really should be a TODO. To be correct, we would need to handle the overflow potential of sum i.e., jumping from uint16 to uint32 or something because the summation gets bigger. I wonder how dask or fau handles this internally? For example https://github.com/scverse/fast-array-utils/blob/b3c7c9829b2fe4e6d23915a2c6bab1b025484b10/src/fast_array_utils/stats/_sum.py#L89-L98 actually seems wrong for this exact reason, no?

Copy link
Member

@flying-sheep flying-sheep Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I still stand by my comment. Please don’t bi-partition an extensible set by specifying one partition. This should be a lookup, either central or by defining and accessing an inline dict/switch-case that spells out all options.

regarding the rest: hmm, I guess I conceived fau to be more low-level, i.e. if you fear that something can overflow, set dtype manually, otherwise it preserves wherever possible, e.g. it doesn’t make sense to preserve int dtypes in mean, but fau doesn’t know what how big the numbers you sum up are so it makes no assumptions.

does that make sense or do you think it should be more opinionated?

Copy link
Contributor Author

@ilan-gold ilan-gold Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, the two things are separate. I will add a TODO as well then. Have a look and let me know if this is what you had in mind or if I misunderstood

@ilan-gold
Copy link
Contributor Author

Awesome! I’m not 100% sure how it works though.

The idea is that we go row_chunk-by-row_chunk and aggregate those individually before doing a sum-like operation over all of them to get the final full-dataset aggregation. For example, for sum, the aggregation of rows 0-100 into 10 categories and 15 features produces a 10x15 matrix of that subset's summation aggregation, but the same is true of any subset. If we do these repeatedly over subsets, we can then do a sum operation over these 10x15 matrices to get the final aggregation result.

So each map_blocks operation outputs a 1x10x15 matrix giving an intermediary n_chunks x 10 x 15 matrix that is then summed over the first axis (i.e., the .sum(axis=0) call after map_blocks) to give the final result, a 10x15 matrix.

This architecture allows us to reuse our already existing implementation as a subroutine and paves the way hopefully for #3723 applied here as well where we just have an additional concat operation over the feature space.

Does this clarify things?

@ilan-gold ilan-gold requested a review from flying-sheep July 24, 2025 09:58
Comment on lines 401 to 403
dtype=np.float64
if func not in ["count_nonzero", "sum"]
else data.dtype,
Copy link
Member

@flying-sheep flying-sheep Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I still stand by my comment. Please don’t bi-partition an extensible set by specifying one partition. This should be a lookup, either central or by defining and accessing an inline dict/switch-case that spells out all options.

regarding the rest: hmm, I guess I conceived fau to be more low-level, i.e. if you fear that something can overflow, set dtype manually, otherwise it preserves wherever possible, e.g. it doesn’t make sense to preserve int dtypes in mean, but fau doesn’t know what how big the numbers you sum up are so it makes no assumptions.

does that make sense or do you think it should be more opinionated?

@ilan-gold ilan-gold requested a review from flying-sheep July 24, 2025 10:15
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful!

@flying-sheep flying-sheep merged commit 58240e5 into main Jul 24, 2025
14 checks passed
@flying-sheep flying-sheep deleted the ig/agg_dask branch July 24, 2025 10:36
Nismamjad1 pushed a commit to Nismamjad1/scanpy-yomix that referenced this pull request Oct 3, 2025
Co-authored-by: Philipp A. <flying-sheep@web.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sc.get.aggregate with dask

3 participants