-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): read_lazy
for whole AnnData
lazy-loading + xarray
reading + read_elem_as_dask
-> read_elem_lazy
#1247
base: main
Are you sure you want to change the base?
Changes from 250 commits
7c2e4da
1ba5b99
49c0d49
3666735
3a332ad
d0f4d13
6e89e14
ea29cfa
f4ff236
f50b286
712e085
24dd18b
79d3fdc
9a2be00
d3bcddf
3bae623
84fdc96
2608bc3
e551e18
13e3bb1
2935e45
bf0be15
890b02a
9d37fc8
1ffe43e
4ab1409
f9df5bc
a85da39
3bef77c
e78334a
eb84176
899184f
efb70ec
118f43c
f742a0a
ae68731
f5e7760
d3f3624
96b13a3
68e507a
9c68e36
410aeda
31a30c4
80fe8cb
b314248
02d4735
6e5534a
d26cfe8
94e43a3
5160016
43da9a3
9735ced
f1730c3
9861b56
235096a
dce9f07
2725ef2
ffe89f0
6cb231e
029add9
75a64fc
3f734fe
c4c2356
cc67a9b
fcb1763
adcd48a
2a72ec0
4c659a1
d3a811a
e852a74
8c92a41
47be954
55f706f
6fa97f0
9e2e21d
eb1237c
6724c62
53796a0
076b92f
036ff3f
9415a14
b5dfaac
d45a2ce
cff41c4
3ccbfaf
ff4d487
ed8fedf
e0e8891
3325f38
528026f
aa0d161
41e3038
dc5c6e6
4edd279
969c6af
2a31ab8
2521ff8
628f9fc
ff9412a
36d57be
51610b1
ba8d147
2cf1262
b1feb6f
02741f5
24e8970
ab3e718
4412710
d3401b2
ef0bbf3
3b6d194
4587fec
97eace5
58654a8
67af64f
bfc2e73
b801d35
a1d0b89
160b522
5f80b61
8ce6409
f9ef9f0
5e69a50
1540d27
371fc2b
5cb2d8d
b86ee6b
227a3c6
3debf9b
4b48988
28c95a7
cec633a
97aff04
bf710d0
1dfebde
d355ed0
cfae08a
1c46ec6
f491724
411bd91
7f89eb3
d77ba37
096f2c6
3c7c627
74b1940
04206b8
a7cbbd8
36bc262
af4520a
fa7358f
4db8c1b
eefbee6
39f2838
dcca711
d3121b7
b25e8ba
870a4f2
c2681e0
e7a915b
f93499d
63a4515
b1c8c22
d082a35
e81d155
c8c5271
c4d0146
b34ac0a
1f4ab92
d25f559
c661d39
bff63cb
784ea9b
9b2d9a3
489cc8d
24f11be
57bcfd9
8b07f43
408d62a
cb125bf
e14f53f
3c5641c
dbe09ca
67fc546
41a9335
58e595b
8875374
4c991d4
50cdc66
eb881a9
fe1f0a6
c0c0c6c
fc72011
562817d
9ef7cf5
69b6cc1
3dfefc4
1eb440e
fe77a5c
a654421
015bdca
7201bad
5d813c0
b50e8ad
3ca669c
7ea20df
91fdb90
a71dad8
0ac61c6
af5c2fe
edac279
cdd9b89
e07426a
8278e0f
30d1bb1
509af7f
58122a1
e6fea74
bd509a1
62cda13
4e1a1f6
a242dea
07caf93
8fd1fa0
94cf8ea
f01818a
aca24db
2c082bf
81c5fb9
bb49dd2
942661f
99219c6
98197fe
752e02b
2a38900
cc40369
a807673
852ab20
310191c
1c15b70
b96bd55
a663c5d
d1fce7e
52b6a01
a796d9b
be786d0
e48377a
90f6d77
8e29713
f13bfb4
1643da6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,7 @@ __pycache__/ | |
/*cache/ | ||
/node_modules/ | ||
/data/ | ||
/venv/ | ||
|
||
# Distribution / packaging | ||
/dist/ | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -108,6 +108,7 @@ | |||
("py:class", "numpy.ma.core.MaskedArray"), | ||||
("py:class", "dask.array.core.Array"), | ||||
("py:class", "awkward.highlevel.Array"), | ||||
("py:class", "awkward.Array"), | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just add
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yes, we can't actually. sphinx-doc/sphinx#10591 prevents us from using autodoc mocking with singledispatch in our codebase. so we could add |
||||
("py:class", "anndata._core.sparse_dataset.BaseCompressedSparseDataset"), | ||||
("py:obj", "numpy._typing._array_like._ScalarType_co"), | ||||
# https://github.com/sphinx-doc/sphinx/issues/10974 | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Add {func}`~anndata.experimental.read_elem_lazy` (in place of `read_elem_as_dask`) to handle backed dataframes, sparse arrays, and dense arrays, as well as a {func}`~anndata.experimental.read_lazy` to handle reading in as much of the on-disk data as possible to produce a {class}`~anndata.AnnData` object {user}`ilan-gold` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -84,6 +84,7 @@ doc = [ | |
"anndata[dev-doc]", | ||
] | ||
dev-doc = ["towncrier>=24.8.0"] # release notes tool | ||
test-full = ["anndata[test,lazy]"] | ||
test = [ | ||
"loompy>=3.0.5", | ||
"pytest>=8.2", | ||
|
@@ -109,6 +110,7 @@ cu12 = ["cupy-cuda12x"] | |
cu11 = ["cupy-cuda11x"] | ||
# https://github.com/dask/dask/issues/11290 | ||
dask = ["dask[array]>=2022.09.2,<2024.8.0"] | ||
lazy = ["xarray>=2024.06.0", "aiohttp", "requests", "zarr<3.0.0a0", "anndata[dask]"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what to do about optional deps here....I think
ilan-gold marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[tool.hatch.version] | ||
source = "vcs" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -50,10 +50,13 @@ def _normalize_index( | |
| pd.Index, | ||
index: pd.Index, | ||
) -> slice | int | np.ndarray: # ndarray of int or bool | ||
if not isinstance(index, pd.RangeIndex): | ||
msg = "Don’t call _normalize_index with non-categorical/string names" | ||
assert index.dtype != float, msg | ||
assert index.dtype != int, msg | ||
from ..experimental.backed._compat import DataArray | ||
|
||
# TODO: why is this here? All tests pass without it and it seems at the minimum not strict enough. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note this comment. This line was causing problems with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ruff has a check to disallow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But it would be good to understand what we even going on here @ivirshup ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean if the tests pass, we're good? And if we're not good, we should add a test? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm, so it’s not possible that If that’s impossible, feel free to remove. Otherwise we should probably update this check to a TypeError or so. |
||
# if not isinstance(index, pd.RangeIndex): | ||
# msg = "Don’t call _normalize_index with non-categorical/string names and non-range index" | ||
# assert index.dtype != float, msg | ||
# assert index.dtype != int, msg | ||
|
||
# the following is insanely slow for sequences, | ||
# we replaced it using pandas below | ||
|
@@ -107,6 +110,10 @@ def name_idx(i): | |
"are not valid obs/ var names or indices." | ||
) | ||
return positions # np.ndarray[int] | ||
elif isinstance(indexer, DataArray): | ||
if isinstance(indexer.data, DaskArray): | ||
return indexer.data.compute() | ||
return indexer.data | ||
else: | ||
raise IndexError(f"Unknown indexer {indexer!r} of type {type(indexer)}") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other lines in this listing are type aliases, protocols, and so on. Are these here classes?
It’s OK to have whatever in experimental, but we need to be very careful about how big our API surface gets when we move these things out of experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that's an interesting point. I thought that by exporting
read_lazy
/making it general, I might appease @ivirshup pre-emptively by not creating only a monolith function and making something more composable a laread_elem
. But in fact, by doing that, we create more API surface, yes.Dataset2D
is something that will be visible on theAnnData
object viaadata.obs
but the others would not be if we didn't makeread_lazy
so general (i.e., because we added functionality to it from its previous state asread_elem_as_dask
). Tough question.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in #1616, I think it’s fine to not export the full API surface of the classes we have and just commit to a carefully designed protocol instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flying-sheep I have been thinking about this a bit. Would it be acceptable to simply say "we subclass these things, but make no guarantees about the stability of our added APIs - please see the super class' documentation for supported methods." ?
I think this is perfectly fair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, we don’t need to write anything, people will only see the superclass, i.e. the stable API:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to our sphinx-autodoc-singledispatch issue, I don't think we can do this because we don't install the package unless this can work without the package installed or mocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, right the intersphinx mapping is independent of installed packages!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok @flying-sheep I am not sure
qualname_overrides
works this way - I tried adding it in but could not get the links to redirect to the xarray docsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think they only work for param docs. I should get around to check if I can extend it to work for pure Sphinx as well.