Skip to content

Slow open_datatree for zarr stores with many coordinate variables #9640

Open
@TomNicholas

Description

@TomNicholas

Originally posted by @aladinor in #9511 (comment)

Hi everyone,

I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently. Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using xradar. Then, the data is stored in Zarr format.

This data model stores historical weather radar datasets in Zarr format while supporting real-time updates as radar networks operate continuously. It leverages a Zarr-append pattern for seamless data integration.

I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the Zarr store using open_datatree. As shown in the following snippet, the time to open the dataset grows as its size increases:

For ~15 GB in size, open_datatree takes around 5.73 seconds

For ~80 GB in size, open_datatree takes around 11.6 seconds

I've worked with larger datasets, which take more time to open/read.

The datatree structure contains 11 nodes, each representing a point where live-updating data is appended. This is a minimal reproducible example, in case you want to look at it.

import s3fs
import xarray as xr
from time import time


def main():
    print(xr.__version__)
    st = time()
    ## S3 bucket connection
    URL = 'https://js2.jetstream-cloud.org:8001/'
    path = f'pythia/radar/erad2024'
    fs = s3fs.S3FileSystem(anon=True,
                           client_kwargs=dict(endpoint_url=URL))
    file = s3fs.S3Map(f"{path}/zarr_radar/Guaviare_test.zarr", s3=fs)

    # opening datatree stored in zarr
    dtree = xr.backends.api.open_datatree(
        file,
        engine='zarr',
        consolidated=True,
        chunks={}
    )
    print(f"total time: {time() -st}")


if __name__ == "__main__":
    main()

and the output is

2024.9.1.dev23+g52f13d44
total time: 5.198976516723633

For more information about the data model, you can check this raw2zarr GitHub repo and the poster we presented at the ScyPy conference.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions