Skip to content

Potential performance optimization for Zarr backend #8290

Closed
@rabernat

Description

@rabernat

What is your issue?

We have identified an inefficiency in the way the ZarrArrayWrapper works. This class currently stores a reference to a ZarrStore and a variable name

class ZarrArrayWrapper(BackendArray):
__slots__ = ("datastore", "dtype", "shape", "variable_name")
def __init__(self, variable_name, datastore):
self.datastore = datastore
self.variable_name = variable_name

When accessing the array, the parent group of the array is read and used to open a new Zarr array.

def get_array(self):
return self.datastore.zarr_group[self.variable_name]

This is a relatively metadata-intensive operation for Zarr. It requires reading both the group metadata and the array metadata. Because of how this wrapper works, these operations currently happen every time data is read from the array. If we have a dask array wrapping the zarr array with thousands of chunks, these metadata operations will happen within every single task. For high latency stores, this is really bad.

Instead, we should just reference the zarr.Array object directly within the ZarrArrayWrapper. It's lightweight and easily serializable. There is no need to re-open the array each time we want to read data from it. This change will lead to an immediate performance enhancement in all Zarr operations.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions