(feat): `sc.get.aggregate` via `dask` #3700

ilan-gold · 2025-07-03T18:35:40Z

See #3723 for a follow-up where we would go algorithm-by-algorithm and implement column-wise operations as well

Closes sc.get.aggregate with dask #3659
Tests included or not required because:

Release notes not necessary because:

codecov · 2025-07-03T18:42:30Z

Codecov Report

❌ Patch coverage is 85.71429% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@77dd43c). Learn more about missing BASE report.
⚠️ Report is 42 commits behind head on main.

Files with missing lines	Patch %	Lines
src/scanpy/get/_aggregated.py	84.90%	8 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3700   +/-   ##
=======================================
  Coverage        ?   76.00%           
=======================================
  Files           ?      117           
  Lines           ?    12600           
  Branches        ?        0           
=======================================
  Hits            ?     9576           
  Misses          ?     3024           
  Partials        ?        0

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_qc.py	`95.80% <100.00%> (ø)`
src/scanpy/get/_aggregated.py	`90.04% <84.90%> (ø)`

ilan-gold · 2025-07-10T15:16:10Z

@Intron7 It won't let me request your review, but you're welcome to have a look

ilan-gold · 2025-07-16T13:38:35Z

TODOs:

CSC aggregation, i.e., aggregate by groups of features and then concatenate
Similar to the above, try out adding (or reusing) some sort of chunks parameter that allows doing the aggregation on a subset of the requested groupby-categories and then concatenate at the end to potentially memory.

flying-sheep · 2025-07-21T12:58:31Z

I added @Intron7 to https://github.com/orgs/scverse/teams/scanpy, now you can request his review (I also did so)

flying-sheep

Awesome! I’m not 100% sure how it works though.

Could you explain how it works in respect to using the categories in the chunking? Are you making heterogeneous chunks to collect each subset according to a category?

src/scanpy/get/_aggregated.py

docs/release-notes/3700.feature.md

src/scanpy/get/_aggregated.py

flying-sheep · 2025-07-22T11:21:40Z

src/scanpy/get/_aggregated.py

+                dtype=np.float64
+                if func not in ["count_nonzero", "sum"]
+                else data.dtype,


Can this info be centralized? this looks fragile because it’s an inline check based on the currently available functions.

This really should be a TODO. To be correct, we would need to handle the overflow potential of sum i.e., jumping from uint16 to uint32 or something because the summation gets bigger. I wonder how dask or fau handles this internally? For example https://github.com/scverse/fast-array-utils/blob/b3c7c9829b2fe4e6d23915a2c6bab1b025484b10/src/fast_array_utils/stats/_sum.py#L89-L98 actually seems wrong for this exact reason, no?

First, I still stand by my comment. Please don’t bi-partition an extensible set by specifying one partition. This should be a lookup, either central or by defining and accessing an inline dict/switch-case that spells out all options.

regarding the rest: hmm, I guess I conceived fau to be more low-level, i.e. if you fear that something can overflow, set dtype manually, otherwise it preserves wherever possible, e.g. it doesn’t make sense to preserve int dtypes in mean, but fau doesn’t know what how big the numbers you sum up are so it makes no assumptions.

does that make sense or do you think it should be more opinionated?

Fair point, the two things are separate. I will add a TODO as well then. Have a look and let me know if this is what you had in mind or if I misunderstood

tests/test_aggregated.py

tests/test_qc_metrics.py

tests/test_aggregated.py

src/scanpy/get/_aggregated.py

docs/release-notes/3700.feature.md

src/scanpy/get/_aggregated.py

tests/test_qc_metrics.py

Co-authored-by: Philipp A. <flying-sheep@web.de>

ilan-gold · 2025-07-23T16:51:22Z

Awesome! I’m not 100% sure how it works though.

The idea is that we go row_chunk-by-row_chunk and aggregate those individually before doing a sum-like operation over all of them to get the final full-dataset aggregation. For example, for sum, the aggregation of rows 0-100 into 10 categories and 15 features produces a 10x15 matrix of that subset's summation aggregation, but the same is true of any subset. If we do these repeatedly over subsets, we can then do a sum operation over these 10x15 matrices to get the final aggregation result.

So each map_blocks operation outputs a 1x10x15 matrix giving an intermediary n_chunks x 10 x 15 matrix that is then summed over the first axis (i.e., the .sum(axis=0) call after map_blocks) to give the final result, a 10x15 matrix.

This architecture allows us to reuse our already existing implementation as a subroutine and paves the way hopefully for #3723 applied here as well where we just have an additional concat operation over the feature space.

Does this clarify things?

src/scanpy/get/_aggregated.py

flying-sheep · 2025-07-24T09:57:03Z

src/scanpy/get/_aggregated.py

+                dtype=np.float64
+                if func not in ["count_nonzero", "sum"]
+                else data.dtype,


First, I still stand by my comment. Please don’t bi-partition an extensible set by specifying one partition. This should be a lookup, either central or by defining and accessing an inline dict/switch-case that spells out all options.

regarding the rest: hmm, I guess I conceived fau to be more low-level, i.e. if you fear that something can overflow, set dtype manually, otherwise it preserves wherever possible, e.g. it doesn’t make sense to preserve int dtypes in mean, but fau doesn’t know what how big the numbers you sum up are so it makes no assumptions.

does that make sense or do you think it should be more opinionated?

Co-authored-by: Philipp A. <flying-sheep@web.de>

flying-sheep

beautiful!

Co-authored-by: Philipp A. <flying-sheep@web.de>

(feat): aggregate via dask

0c6aaf6

ilan-gold changed the title ~~(feat): aggregate via dask~~ (feat): sc.get.aggregate via dask Jul 3, 2025

ilan-gold added 6 commits July 4, 2025 12:40

(feat): var operation

5ba33b0

Merge branch 'main' into ig/agg_dask

693aa09

(fix): relative import

0e37ea7

Merge branch 'ig/agg_dask' of github.com:scverse/scanpy into ig/agg_dask

de4ef27

(refactor): remove duplicated code

900ef94

refactor: do mean/var together if possible

1feb7ab

ilan-gold force-pushed the ig/agg_dask branch from e1a7012 to 1feb7ab Compare July 4, 2025 11:33

todo message

166bf79

ilan-gold added this to the 1.12.0 milestone Jul 4, 2025

ilan-gold added 4 commits July 4, 2025 13:37

refactor: revert sum_sq thing

52b00b6

chore: comment

28d9899

(fix): reomve unnecessary todo

7f80282

(fix): clarify potential failure case

b6b4978

Zethson mentioned this pull request Jul 9, 2025

Support for backed='r' AnnData in pseudobulk() and related functions for large datasets scverse/decoupler#208

Closed

ilan-gold added 2 commits July 10, 2025 15:21

fix: compute one of the means :/

a3a9365

chore: relnote

2b36963

ilan-gold added 2 commits July 14, 2025 16:53

Merge branch 'main' into ig/agg_dask

aee1921

Merge branch 'main' into ig/agg_dask

644283a

ilan-gold mentioned this pull request Jul 18, 2025

Refactor BasePlot not to create a dataframe representation of the data #3718

Open

4 tasks

ilan-gold added 4 commits July 18, 2025 14:39

fix: refactor 1d chunking

92ddcc5

fix: relnote

9022651

chore: add failure-case tests

650e909

fix: use if statement for compute

7ff6c72

ilan-gold requested a review from flying-sheep July 21, 2025 09:11

ilan-gold mentioned this pull request Jul 21, 2025

Column-wise dask map_blocks #3723

Open

4 tasks

ilan-gold marked this pull request as ready for review July 21, 2025 09:24

chore: dtypes

1b2787c

flying-sheep requested a review from Intron7 July 21, 2025 12:58

flying-sheep requested changes Jul 22, 2025

View reviewed changes

ilan-gold commented Jul 23, 2025

View reviewed changes

ilan-gold and others added 4 commits July 23, 2025 15:18

Update 3700.feature.md

f9fa0e5

Merge branch 'main' into ig/agg_dask

13edf64

Apply suggestions from code review

2faef67

Co-authored-by: Philipp A. <flying-sheep@web.de>

fix: xfail_dask_median handling

bf9eccd

Merge branch 'main' into ig/agg_dask

403d5e7

ilan-gold requested a review from flying-sheep July 24, 2025 09:58

flying-sheep reviewed Jul 24, 2025

View reviewed changes

ilan-gold and others added 3 commits July 24, 2025 12:03

Update src/scanpy/get/_aggregated.py

56efa79

Co-authored-by: Philipp A. <flying-sheep@web.de>

fix: centralize check

a1e9f6a

Merge branch 'ig/agg_dask' of github.com:scverse/scanpy into ig/agg_dask

ec2f1f2

ilan-gold requested a review from flying-sheep July 24, 2025 10:15

flying-sheep approved these changes Jul 24, 2025

View reviewed changes

flying-sheep merged commit 58240e5 into main Jul 24, 2025
14 checks passed

flying-sheep deleted the ig/agg_dask branch July 24, 2025 10:36

Nismamjad1 pushed a commit to Nismamjad1/scanpy-yomix that referenced this pull request Oct 3, 2025

(feat): sc.get.aggregate via dask (scverse#3700)

23b324e

Co-authored-by: Philipp A. <flying-sheep@web.de>

(feat): sc.get.aggregate via dask #3700

(feat): sc.get.aggregate via dask #3700

Uh oh!

Conversation

ilan-gold commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented Jul 10, 2025

Uh oh!

ilan-gold commented Jul 16, 2025

Uh oh!

flying-sheep commented Jul 21, 2025

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flying-sheep Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold commented Jul 23, 2025

Uh oh!

Uh oh!

flying-sheep Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

(feat): `sc.get.aggregate` via `dask` #3700

(feat): `sc.get.aggregate` via `dask` #3700

ilan-gold commented Jul 3, 2025 •

edited

Loading

codecov bot commented Jul 3, 2025 •

edited

Loading

flying-sheep Jul 24, 2025 •

edited

Loading

ilan-gold Jul 24, 2025 •

edited

Loading

flying-sheep Jul 24, 2025 •

edited

Loading