(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold · 2023-11-30T12:51:54Z

This PR is a lighter weight version of #947 that involves using the original AnnData object as the class to hold obs and var xr.Dataset.

Closes Dask and Zarr not loading obsp and obsm from remote s3 #951 and closes lazy dataframes in .obs and .var with backed="r" mode #981
Tests added
Release note added (or unnecessary)

codecov · 2023-12-07T16:24:55Z

Codecov Report

Attention: Patch coverage is 95.32294% with 21 lines in your changes missing coverage. Please review.

Project coverage is 85.20%. Comparing base (0024b82) to head (1643da6).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/anndata/_core/storage.py	37.50%	5 Missing ⚠️
src/anndata/experimental/backed/_lazy_arrays.py	95.45%	4 Missing ⚠️
src/anndata/experimental/backed/_compat.py	84.21%	3 Missing ⚠️
src/anndata/experimental/backed/_xarray.py	95.52%	3 Missing ⚠️
src/anndata/tests/helpers.py	85.00%	3 Missing ⚠️
src/anndata/_io/specs/registry.py	88.23%	2 Missing ⚠️
src/anndata/_io/specs/lazy_methods.py	98.59%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1247      +/-   ##
==========================================
+ Coverage   84.51%   85.20%   +0.69%     
==========================================
  Files          40       45       +5     
  Lines        6050     6457     +407     
==========================================
+ Hits         5113     5502     +389     
- Misses        937      955      +18

Files with missing lines	Coverage Δ
src/anndata/_core/aligned_df.py	`97.87% <100.00%> (+0.09%)`	⬆️
src/anndata/_core/anndata.py	`83.77% <100.00%> (+0.04%)`	⬆️
src/anndata/_core/index.py	`95.00% <100.00%> (+0.03%)`	⬆️
src/anndata/_core/merge.py	`86.00% <100.00%> (+2.09%)`	⬆️
src/anndata/_core/views.py	`85.71% <100.00%> (+0.38%)`	⬆️
src/anndata/_io/specs/__init__.py	`100.00% <ø> (ø)`
src/anndata/_io/zarr.py	`83.75% <100.00%> (+0.20%)`	⬆️
src/anndata/_types.py	`86.11% <100.00%> (+0.81%)`	⬆️
src/anndata/experimental/__init__.py	`100.00% <100.00%> (ø)`
src/anndata/experimental/backed/__init__.py	`100.00% <100.00%> (ø)`
... and 9 more

... and 1 file with indirect coverage changes

…ocstring" This reverts commit 79d3fdc.

flying-sheep

Please make a PR into the notebook repo, there are some things broken in the notebook that I’d like to properly review.

Also I’m still not a fan of the exports from private namespaces, but I’m OK with it if there are at least minimal docstrings for them.

This is very close! Thanks for taking on this huge thing!

flying-sheep · 2024-10-22T09:10:13Z

docs/api.md

+   experimental.backed._lazy_arrays.MaskedArray
+   experimental.backed._lazy_arrays.CategoricalArray
+   experimental.backed._xarray.Dataset2D


Yeah, I think they only work for param docs. I should get around to check if I can extend it to work for pure Sphinx as well.

src/anndata/experimental/backed/_xarray.py

flying-sheep · 2024-10-22T09:34:33Z

src/anndata/_core/index.py

-        assert index.dtype != int, msg
+    from ..experimental.backed._compat import xr
+
+    # TODO: why is this here? All tests pass without it and it seems at the minimum not strict enough.


asserts in runtime code are purely there to make debugging easier in case the asserts fail.

Hmm, so it’s not possible that _normalize_index ever gets called with the wrong type of index as result of user action?

If that’s impossible, feel free to remove. Otherwise we should probably update this check to a TypeError or so.

flying-sheep · 2024-10-22T09:37:45Z

src/anndata/_core/merge.py

+        dtype = "object"
+    else:
+        dtype = col.dtype.numpy_dtype
+    # TODO: get good chunk size?


small reminder in case you have an idea for this TODO

src/anndata/_io/specs/lazy_methods.py

tests/test_read_lazy.py

flying-sheep · 2024-10-22T11:14:50Z

tests/test_read_lazy.py

+
+
+# remote has object dtype, need to convert back for integers booleans etc.
+def correct_extension_dtype_differences(remote: pd.DataFrame, memory: pd.DataFrame):


why does remote have object dtypes? can that be fixed ahead of time? will that affect users or is that just here in the tests?

also for clarity, please call it fix_extension_dtype_differences or unify_extension_dtypes and make the comment a docstring. I was unsure if correct was meant as a verb or something like make_correct_...

I will change the name and add a better comment since it's only used for concat. There is no way of concat-ing these lazy pandas adapter arrays we have so in order to work with them, we present a dask array wrapping each when they are concatenated. Practically, this actually doesn't change anything in terms of IO. But it means that when we make the round trip, we lose the original data type. We could maybe store this in uns or something (what the original data type was) but

People concating remote/large datasets probably won't be reading them into memory very often

This gets complicated quickly with mixed data types. So if you have two similarly named columns but different numerical data types, we need to start dealing with upcasting/downcasting.

Feels like a cross-that-bridge-when-we-come-to-it issue

tests/test_read_lazy.py

flying-sheep · 2024-10-22T11:20:15Z

tests/test_read_lazy.py

+    return request.param
+
+
+@pytest.fixture(params=[True, False], scope="session")


why did you mark this as solved?

Co-authored-by: Philipp A. <flying-sheep@web.de>

…/xarray_compat

ilan-gold · 2024-10-31T13:43:14Z

@flying-sheep scverse/anndata-tutorials#20 notebook PR but it seems to run ok for me? Could you post your error?

…with normal pytest

ilan-gold mentioned this pull request Nov 30, 2023

Dask and Zarr not loading obsp and obsm from remote s3 #951

Open

ilan-gold mentioned this pull request Jan 31, 2024

lazy dataframes in .obs and .var with backed="r" mode #981

Open

ilan-gold added this to the 0.11.0 milestone Jul 2, 2024

ilan-gold self-assigned this Jul 2, 2024

ilan-gold added the skip-gpu-ci label Jul 5, 2024

ilan-gold force-pushed the ig/xarray_compat branch from 68fcd2b to 6165f07 Compare July 5, 2024 14:16

ilan-gold changed the base branch from main to ig/read_dask_elem July 9, 2024 15:44

ilan-gold mentioned this pull request Jul 10, 2024

(feat): read_elem_as_dask method #1469

Merged

3 tasks

flying-sheep and others added 21 commits July 11, 2024 10:07

Fix Read/Write

7c2e4da

Fix one more

1ba5b99

unify names

49c0d49

claift ReadCallback signature

3666735

Fix type aliases

3a332ad

(fix): clean up typing to use RWAble

d0f4d13

Merge branch 'main' into ig/protocol_for_callback

6e89e14

(fix): use Union

ea29cfa

(fix): add qualname override

f4ff236

(fix): ignore dask and masked array

f50b286

(fix): ignore erroneous class warning

712e085

(fix): upgrade scanpydoc

24dd18b

(fix): use MutableMapping instead of dict due to broken docstring

79d3fdc

Merge branch 'ig/protocol_for_callback' into ig/read_dask_elem

9a2be00

Add data docs

d3bcddf

Merge branch 'ig/read_dask_elem' into ig/xarray_compat

3bae623

Revert "(fix): use MutableMapping instead of dict due to broken d…

84fdc96

…ocstring" This reverts commit 79d3fdc.

(fix): add clarification

2608bc3

Simplify

e551e18

Merge branch 'ig/protocol_for_callback' into ig/read_dask_elem

13e3bb1

Merge branch 'main' into ig/read_dask_elem

2935e45

ilan-gold mentioned this pull request Oct 17, 2024

(fix): correct default fill values for dask-sparse #1719

Merged

3 tasks

flying-sheep and others added 8 commits October 17, 2024 15:15

Merge branch 'main' into ig/xarray_compat

aca24db

(chore): remove redefinition

2c082bf

(refactor): reuse join type

81c5fb9

(fix): mixed type dataframe merging

bb49dd2

(fix): condition for going to memory in mixed typing

942661f

(refactor): mixed type helper function

99219c6

(fix): try linking to dask/awkward in docs build

98197fe

(fix): awkward array docs

752e02b

ilan-gold requested a review from flying-sheep October 21, 2024 10:20

flying-sheep requested changes Oct 22, 2024

View reviewed changes

ilan-gold and others added 16 commits October 25, 2024 11:19

(chore): ValueError -> AssertionError

2a38900

(fix): clean up _lazy_arrays.py

cc40369

(fix): ValueError->KeyError for store

a807673

(chore): add note about unify_extension_dtypes

852ab20

(chore): add ids

310191c

Apply suggestions from code review

1c15b70

Co-authored-by: Philipp A. <flying-sheep@web.de>

(fix): move all changes form anndata_elem

b96bd55

Merge branch 'main' into ig/xarray_compat

a663c5d

(fix): read_elem_as_dask->read_elem_lazy

d1fce7e

(chore): refactor test_read_lazy fixtures

52b6a01

Update tests/test_read_lazy.py

a796d9b

Co-authored-by: Philipp A. <flying-sheep@web.de>

Merge branch 'ig/xarray_compat' of github.com:scverse/anndata into ig…

be786d0

…/xarray_compat

(chore): restore types

e48377a

(fix): do randint

90f6d77

(chore): ermove slots check

8e29713

(fix): return read_lazy

f13bfb4

ilan-gold mentioned this pull request Oct 31, 2024

(feat): read_lazy.ipynb notebook scverse/anndata-tutorials#20

Open

(fix): concating with hdf5 and cluster obviates need for locks works …

1643da6

…with normal pytest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold commented Nov 30, 2023 •

edited by ivirshup

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

flying-sheep left a comment

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

ilan-gold Oct 25, 2024 •

edited

Loading

flying-sheep Oct 22, 2024

ilan-gold commented Oct 31, 2024



		# remote has object dtype, need to convert back for integers booleans etc.
		def correct_extension_dtype_differences(remote: pd.DataFrame, memory: pd.DataFrame):

		return request.param


		@pytest.fixture(params=[True, False], scope="session")

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Are you sure you want to change the base?

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Conversation

ilan-gold commented Nov 30, 2023 • edited by ivirshup Loading

codecov bot commented Dec 7, 2023 • edited Loading

Codecov Report

flying-sheep left a comment

Choose a reason for hiding this comment

flying-sheep Oct 22, 2024

Choose a reason for hiding this comment

flying-sheep Oct 22, 2024

Choose a reason for hiding this comment

flying-sheep Oct 22, 2024

Choose a reason for hiding this comment

flying-sheep Oct 22, 2024

Choose a reason for hiding this comment

ilan-gold Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

flying-sheep Oct 22, 2024

Choose a reason for hiding this comment

ilan-gold commented Oct 31, 2024

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold commented Nov 30, 2023 •

edited by ivirshup

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

ilan-gold Oct 25, 2024 •

edited

Loading