Slow open_datatree for zarr stores with many coordinate variables

_Originally posted by @aladinor in https://github.com/pydata/xarray/issues/9511#issuecomment-2405191800_

Hi everyone, 

I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently.  Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using [`xradar`](https://docs.openradarscience.org/projects/xradar/en/stable/). Then, the data is stored in `Zarr` format.

This data model stores historical weather radar datasets in `Zarr` format while supporting real-time updates as radar networks operate continuously. It leverages a Zarr-append pattern for seamless data integration.

I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the `Zarr` store using `open_datatree`. As shown in the following snippet, the time to open the dataset grows as its size increases:

For  ~15 GB in size, `open_datatree` takes around 5.73 seconds 
<img src="https://github.com/user-attachments/assets/8ed62b14-00a7-446e-b38f-767e2ce9087c" width=70%>

For ~80 GB in size, `open_datatree` takes around 11.6 seconds 

<img src="https://github.com/user-attachments/assets/1eb05164-4ac3-491d-8005-09e158b92fdb" width=70%>

I've worked with larger datasets, which take more time to open/read.

The datatree structure contains 11 nodes, each representing a point where live-updating data is appended.  This is a minimal reproducible example, in case you want to look at it.

```python 
import s3fs
import xarray as xr
from time import time


def main():
    print(xr.__version__)
    st = time()
    ## S3 bucket connection
    URL = 'https://js2.jetstream-cloud.org:8001/'
    path = f'pythia/radar/erad2024'
    fs = s3fs.S3FileSystem(anon=True,
                           client_kwargs=dict(endpoint_url=URL))
    file = s3fs.S3Map(f"{path}/zarr_radar/Guaviare_test.zarr", s3=fs)

    # opening datatree stored in zarr
    dtree = xr.backends.api.open_datatree(
        file,
        engine='zarr',
        consolidated=True,
        chunks={}
    )
    print(f"total time: {time() -st}")


if __name__ == "__main__":
    main()
```

and the output is 

```python 
2024.9.1.dev23+g52f13d44
total time: 5.198976516723633
```
For more information about the data model, you can check this [`raw2zarr`](https://github.com/aladinor/raw2zarr) GitHub repo and the [poster](https://github.com/aladinor/raw2zarr/blob/main/SCIPY_POSTER.pdf) we presented at the ScyPy conference.
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Slow open_datatree for zarr stores with many coordinate variables #9640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Slow open_datatree for zarr stores with many coordinate variables #9640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions