Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate Zarr array read #8472

Merged
merged 5 commits into from
Dec 1, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,14 @@ def encode_zarr_attr_value(value):
class ZarrArrayWrapper(BackendArray):
__slots__ = ("datastore", "dtype", "shape", "variable_name", "_array")

def __init__(self, variable_name, datastore):
def __init__(self, variable_name, zarr_array, datastore):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, variable_name, zarr_array, datastore):
def __init__(self, zarr_array):

Fairly sure this should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test this properly. I've not had good results doing this trick on a different datastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the heads up! what issues did you see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure how the zarr backend is implemented, but in my case after xr.open_dataset() the datastore is closed and not needed anymore( until .compute()).

But in this case you're sending a reference from that store to the backendarray which will be lost once the store is closed.

I've gone the opposite way, in get_variables only the required keys are sent. Then only once the backendarray __getitem__ is triggered the arrays are loaded.

I noticed this quite quickly once I played around with the dataset and loaded some variables and if you haven't noticed it yet maybe the zarr backend does some interesting tricks to avoid the closed references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yerah not sure. We do have distributed tests for reading zarr so presumably it's all OK

self.datastore = datastore
self.variable_name = variable_name

# some callers attempt to evaluate an array if an `array` property exists on the object.
# we prefix with _ to avoid this inference.
self._array = self.datastore.zarr_group[self.variable_name]

self._array = zarr_array
dcherian marked this conversation as resolved.
Show resolved Hide resolved
self.shape = self._array.shape

# preserve vlen string object dtype (GH 7328)
Expand Down Expand Up @@ -501,7 +502,7 @@ def ds(self):
return self.zarr_group

def open_store_variable(self, name, zarr_array):
data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, self))
data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, zarr_array, self))
try_nczarr = self._mode == "r"
dimensions, attributes = _get_zarr_dims_and_attrs(
zarr_array, DIMENSION_KEY, try_nczarr
Expand Down
Loading