Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Open
wants to merge 407 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Nov 30, 2023

This PR is a lighter weight version of #947 that involves using the original AnnData object as the class to hold obs and var xr.Dataset.

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Attention: Patch coverage is 95.32294% with 21 lines in your changes missing coverage. Please review.

Project coverage is 85.20%. Comparing base (0024b82) to head (1643da6).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/anndata/_core/storage.py 37.50% 5 Missing ⚠️
src/anndata/experimental/backed/_lazy_arrays.py 95.45% 4 Missing ⚠️
src/anndata/experimental/backed/_compat.py 84.21% 3 Missing ⚠️
src/anndata/experimental/backed/_xarray.py 95.52% 3 Missing ⚠️
src/anndata/tests/helpers.py 85.00% 3 Missing ⚠️
src/anndata/_io/specs/registry.py 88.23% 2 Missing ⚠️
src/anndata/_io/specs/lazy_methods.py 98.59% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1247      +/-   ##
==========================================
+ Coverage   84.51%   85.20%   +0.69%     
==========================================
  Files          40       45       +5     
  Lines        6050     6457     +407     
==========================================
+ Hits         5113     5502     +389     
- Misses        937      955      +18     
Files with missing lines Coverage Δ
src/anndata/_core/aligned_df.py 97.87% <100.00%> (+0.09%) ⬆️
src/anndata/_core/anndata.py 83.77% <100.00%> (+0.04%) ⬆️
src/anndata/_core/index.py 95.00% <100.00%> (+0.03%) ⬆️
src/anndata/_core/merge.py 86.00% <100.00%> (+2.09%) ⬆️
src/anndata/_core/views.py 85.71% <100.00%> (+0.38%) ⬆️
src/anndata/_io/specs/__init__.py 100.00% <ø> (ø)
src/anndata/_io/zarr.py 83.75% <100.00%> (+0.20%) ⬆️
src/anndata/_types.py 86.11% <100.00%> (+0.81%) ⬆️
src/anndata/experimental/__init__.py 100.00% <100.00%> (ø)
src/anndata/experimental/backed/__init__.py 100.00% <100.00%> (ø)
... and 9 more

... and 1 file with indirect coverage changes

@ilan-gold ilan-gold added this to the 0.11.0 milestone Jul 2, 2024
@ilan-gold ilan-gold self-assigned this Jul 2, 2024
@ilan-gold ilan-gold changed the base branch from main to ig/read_dask_elem July 9, 2024 15:44
@ilan-gold ilan-gold mentioned this pull request Jul 10, 2024
3 tasks
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make a PR into the notebook repo, there are some things broken in the notebook that I’d like to properly review.

Also I’m still not a fan of the exports from private namespaces, but I’m OK with it if there are at least minimal docstrings for them.

This is very close! Thanks for taking on this huge thing!

Comment on lines +152 to +154
experimental.backed._lazy_arrays.MaskedArray
experimental.backed._lazy_arrays.CategoricalArray
experimental.backed._xarray.Dataset2D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think they only work for param docs. I should get around to check if I can extend it to work for pure Sphinx as well.

src/anndata/experimental/backed/_xarray.py Show resolved Hide resolved
assert index.dtype != int, msg
from ..experimental.backed._compat import xr

# TODO: why is this here? All tests pass without it and it seems at the minimum not strict enough.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asserts in runtime code are purely there to make debugging easier in case the asserts fail.

Hmm, so it’s not possible that _normalize_index ever gets called with the wrong type of index as result of user action?

If that’s impossible, feel free to remove. Otherwise we should probably update this check to a TypeError or so.

dtype = "object"
else:
dtype = col.dtype.numpy_dtype
# TODO: get good chunk size?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small reminder in case you have an idea for this TODO

src/anndata/_io/specs/lazy_methods.py Outdated Show resolved Hide resolved
tests/test_read_lazy.py Outdated Show resolved Hide resolved
tests/test_read_lazy.py Outdated Show resolved Hide resolved


# remote has object dtype, need to convert back for integers booleans etc.
def correct_extension_dtype_differences(remote: pd.DataFrame, memory: pd.DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does remote have object dtypes? can that be fixed ahead of time? will that affect users or is that just here in the tests?

also for clarity, please call it fix_extension_dtype_differences or unify_extension_dtypes and make the comment a docstring. I was unsure if correct was meant as a verb or something like make_correct_...

Copy link
Contributor Author

@ilan-gold ilan-gold Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the name and add a better comment since it's only used for concat. There is no way of concat-ing these lazy pandas adapter arrays we have so in order to work with them, we present a dask array wrapping each when they are concatenated. Practically, this actually doesn't change anything in terms of IO. But it means that when we make the round trip, we lose the original data type. We could maybe store this in uns or something (what the original data type was) but

  1. People concating remote/large datasets probably won't be reading them into memory very often
  2. This gets complicated quickly with mixed data types. So if you have two similarly named columns but different numerical data types, we need to start dealing with upcasting/downcasting.

Feels like a cross-that-bridge-when-we-come-to-it issue

tests/test_read_lazy.py Outdated Show resolved Hide resolved
return request.param


@pytest.fixture(params=[True, False], scope="session")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you mark this as solved?

@ilan-gold
Copy link
Contributor Author

@flying-sheep scverse/anndata-tutorials#20 notebook PR but it seems to run ok for me? Could you post your error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants