Description
What is your issue?
We have identified an inefficiency in the way the ZarrArrayWrapper
works. This class currently stores a reference to a ZarrStore
and a variable name
xarray/xarray/backends/zarr.py
Lines 63 to 68 in 75af56c
When accessing the array, the parent group of the array is read and used to open a new Zarr array.
xarray/xarray/backends/zarr.py
Lines 83 to 84 in 75af56c
This is a relatively metadata-intensive operation for Zarr. It requires reading both the group metadata and the array metadata. Because of how this wrapper works, these operations currently happen every time data is read from the array. If we have a dask array wrapping the zarr array with thousands of chunks, these metadata operations will happen within every single task. For high latency stores, this is really bad.
Instead, we should just reference the zarr.Array
object directly within the ZarrArrayWrapper
. It's lightweight and easily serializable. There is no need to re-open the array each time we want to read data from it. This change will lead to an immediate performance enhancement in all Zarr operations.