Skip to content

Adding new datasets in zarr hierarchy (group) is slower and slower, as previous groups metadata/data is scanned over and over #8488

Closed
@mdespriee

Description

@mdespriee

What happened?

While creating a zarr file in s3 with many datasets in it, organized in various groups and subgroups, I noticed the process was getting slower and slower.
By tracking the calls to s3 with a profiler, I noticed many calls targeting groups in the zarr that were not concerned by the current insertion.

I extracted the bulk of the logic in the reproducer below:

  • a store created with zarr.storage.FSStore
  • datasets being added using dataset.to_zarr(store, group=...)

I added a class of fsspec to track calls to the filesystem (on top of local storage).

If you run the reproducer, you'll see that for each insertion of a dataset at a new group, metadata and/or arrays of previous groups get listed or opened.

When used in a s3 context, this give crazy high execution times and volumes of s3 api calls.

What did you expect to happen?

list / open operation only targeting the group being written, and at the root of the zarr

Minimal Complete Verifiable Example

import xarray as xr
import zarr
import zarr.storage
import numpy as np
import fsspec
from datetime import datetime
import time

from fsspec.spec import AbstractFileSystem
from fsspec.implementations.local import LocalFileSystem


ds = xr.DataArray(
    np.random.rand(1000),
    dims=["x"], 
    coords={
        "x": range(1000),
        "a": 0,
        "b": 1,
    }, name="array").to_dataset()



class InstrumentedFS(AbstractFileSystem):
    """ A wrapper to track calls to FS
    """
    def __init__(
        self,
        fs: LocalFileSystem,
    ):
        super().__init__()
        self._fs = fs

    def to_json(self):
        pass

    def _open(
        self,
        path,
        mode="rb",
        block_size=None,
        **kwargs,
    ):
        print(f"Opening {path}")
        return self._fs._open(path, mode, block_size, **kwargs)

    @property
    def fsid(self):
        return self._fs.fsid

    def ls(self, path, detail=False, **kwargs):
        print(f"Listing {path}")
        return self._fs.ls(path, detail, **kwargs)

    def cp_file(self, path1, path2, **kwargs):
        print(f"Copying {path1} to {path2}")
        self._fs.cp_file(path1, path2, **kwargs)

    def _rm(self, path):
        self._fs._rm(path)

    def created(self, path):
        print(f"called created {path}")
        return self._fs.created(path)

    def modified(self, path):
        print(f"called modified {path}")
        return self._fs.modified(path)

    def sign(self, path, expiration=100, **kwargs):
        return self._fs.sign(path, expiration, **kwargs)

    def mkdir(self, path, create_parents=True, **kwargs):
        print(f"called mkdir {path}")
        return self._fs.mkdir(path, create_parents, **kwargs)

    def makedirs(self, path, exist_ok=False):
        print(f"called makedirs {path}")
        self._fs.makedirs(path, exist_ok)

    def rmdir(self, path):
        self._fs.rmdir(path)

    def info(self, path, **kwargs):
        print(f"called info {path}")
        return self._fs.info(path, **kwargs)



path=f"/tmp/test_{datetime.now().strftime("%Y%m%d%H%M")}.zarr"
print(path)
fs = fsspec.open(path).fs
ifs = InstrumentedFS(fs=fs)


store = zarr.storage.FSStore(
    url=path,
    mode="w",
    fs=ifs,
    create=True
)


for i in range(10):
    print("----------------------------------")
    print(f"group {i}")
    ds.to_zarr(
        store = store,
        group = "group"+str(i),
        encoding = {"x": {"chunks": (-1, -1)}},
    )

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Here is an extract of log output, while adding a 9th dataset.

----------------------------------
group 9
called info /tmp/test_202311281818.zarr/group9/.zarray
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/.zarray
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zarray
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zgroup
called makedirs /tmp/test_202311281818.zarr/group9
Opening /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zarray
Opening /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/x/.zarray
called info /tmp/test_202311281818.zarr/group9/x/.zgroup
called info /tmp/test_202311281818.zarr/group9/a/.zarray
called info /tmp/test_202311281818.zarr/group9/a/.zgroup
called info /tmp/test_202311281818.zarr/group9/b/.zarray
called info /tmp/test_202311281818.zarr/group9/b/.zgroup
called info /tmp/test_202311281818.zarr/group9/array/.zarray
called info /tmp/test_202311281818.zarr/group9/array/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zattrs
called makedirs /tmp/test_202311281818.zarr/group9
Opening /tmp/test_202311281818.zarr/group9/.zattrs
called info /tmp/test_202311281818.zarr/group9/x/.zarray
called info /tmp/test_202311281818.zarr/group9/x/.zgroup
called info /tmp/test_202311281818.zarr/.zarray
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zarray
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/x/.zarray
called info /tmp/test_202311281818.zarr/group9/x/.zgroup
called info /tmp/test_202311281818.zarr/group9/x/.zarray
called makedirs /tmp/test_202311281818.zarr/group9/x
Opening /tmp/test_202311281818.zarr/group9/x/.zarray
Opening /tmp/test_202311281818.zarr/group9/x/.zarray
called info /tmp/test_202311281818.zarr/group9/x/.zattrs
called makedirs /tmp/test_202311281818.zarr/group9/x
Opening /tmp/test_202311281818.zarr/group9/x/.zattrs
Opening /tmp/test_202311281818.zarr/group9/x/0
called info /tmp/test_202311281818.zarr/group9/a/.zarray
called info /tmp/test_202311281818.zarr/group9/a/.zgroup
called info /tmp/test_202311281818.zarr/.zarray
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zarray
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/a/.zarray
called info /tmp/test_202311281818.zarr/group9/a/.zgroup
called info /tmp/test_202311281818.zarr/group9/a/.zarray
called makedirs /tmp/test_202311281818.zarr/group9/a
Opening /tmp/test_202311281818.zarr/group9/a/.zarray
Opening /tmp/test_202311281818.zarr/group9/a/.zarray
called info /tmp/test_202311281818.zarr/group9/a/.zattrs
called makedirs /tmp/test_202311281818.zarr/group9/a
Opening /tmp/test_202311281818.zarr/group9/a/.zattrs
Opening /tmp/test_202311281818.zarr/group9/a/0
called info /tmp/test_202311281818.zarr/group9/a/0
called makedirs /tmp/test_202311281818.zarr/group9/a
Opening /tmp/test_202311281818.zarr/group9/a/0
called info /tmp/test_202311281818.zarr/group9/array/.zarray
called info /tmp/test_202311281818.zarr/group9/array/.zgroup
called info /tmp/test_202311281818.zarr/.zarray
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zarray
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/array/.zarray
called info /tmp/test_202311281818.zarr/group9/array/.zgroup
called info /tmp/test_202311281818.zarr/group9/array/.zarray
called makedirs /tmp/test_202311281818.zarr/group9/array
Opening /tmp/test_202311281818.zarr/group9/array/.zarray
Opening /tmp/test_202311281818.zarr/group9/array/.zarray
called info /tmp/test_202311281818.zarr/group9/array/.zattrs
called makedirs /tmp/test_202311281818.zarr/group9/array
Opening /tmp/test_202311281818.zarr/group9/array/.zattrs
Opening /tmp/test_202311281818.zarr/group9/array/0
called info /tmp/test_202311281818.zarr/group9/b/.zarray
called info /tmp/test_202311281818.zarr/group9/b/.zgroup
called info /tmp/test_202311281818.zarr/.zarray
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zarray
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/.zgroup
called info /tmp/test_202311281818.zarr/group9/b/.zarray
called info /tmp/test_202311281818.zarr/group9/b/.zgroup
called info /tmp/test_202311281818.zarr/group9/b/.zarray
called makedirs /tmp/test_202311281818.zarr/group9/b
Opening /tmp/test_202311281818.zarr/group9/b/.zarray
Opening /tmp/test_202311281818.zarr/group9/b/.zarray
called info /tmp/test_202311281818.zarr/group9/b/.zattrs
called makedirs /tmp/test_202311281818.zarr/group9/b
Opening /tmp/test_202311281818.zarr/group9/b/.zattrs
Opening /tmp/test_202311281818.zarr/group9/b/0
called info /tmp/test_202311281818.zarr/group9/b/0
called makedirs /tmp/test_202311281818.zarr/group9/b
Opening /tmp/test_202311281818.zarr/group9/b/0
Listing /tmp/test_202311281818.zarr
Listing /tmp/test_202311281818.zarr/group6
Listing /tmp/test_202311281818.zarr/group6/array
Listing /tmp/test_202311281818.zarr/group6/x
Listing /tmp/test_202311281818.zarr/group6/a
Listing /tmp/test_202311281818.zarr/group6/b
Listing /tmp/test_202311281818.zarr/group7
Listing /tmp/test_202311281818.zarr/group7/array
Listing /tmp/test_202311281818.zarr/group7/x
Listing /tmp/test_202311281818.zarr/group7/a
Listing /tmp/test_202311281818.zarr/group7/b
Listing /tmp/test_202311281818.zarr/group3
Listing /tmp/test_202311281818.zarr/group3/array
Listing /tmp/test_202311281818.zarr/group3/x
Listing /tmp/test_202311281818.zarr/group3/a
Listing /tmp/test_202311281818.zarr/group3/b
Listing /tmp/test_202311281818.zarr/group8
Listing /tmp/test_202311281818.zarr/group8/array
Listing /tmp/test_202311281818.zarr/group8/x
Listing /tmp/test_202311281818.zarr/group8/a
Listing /tmp/test_202311281818.zarr/group8/b
Listing /tmp/test_202311281818.zarr/group4
Listing /tmp/test_202311281818.zarr/group4/array
Listing /tmp/test_202311281818.zarr/group4/x
Listing /tmp/test_202311281818.zarr/group4/a
Listing /tmp/test_202311281818.zarr/group4/b
Listing /tmp/test_202311281818.zarr/group5
Listing /tmp/test_202311281818.zarr/group5/array
Listing /tmp/test_202311281818.zarr/group5/x
Listing /tmp/test_202311281818.zarr/group5/a
Listing /tmp/test_202311281818.zarr/group5/b
Listing /tmp/test_202311281818.zarr/group1
Listing /tmp/test_202311281818.zarr/group1/array
Listing /tmp/test_202311281818.zarr/group1/x
Listing /tmp/test_202311281818.zarr/group1/a
Listing /tmp/test_202311281818.zarr/group1/b
Listing /tmp/test_202311281818.zarr/group9
Listing /tmp/test_202311281818.zarr/group9/array
Listing /tmp/test_202311281818.zarr/group9/x
Listing /tmp/test_202311281818.zarr/group9/a
Listing /tmp/test_202311281818.zarr/group9/b
Listing /tmp/test_202311281818.zarr/group2
Listing /tmp/test_202311281818.zarr/group2/array
Listing /tmp/test_202311281818.zarr/group2/x
Listing /tmp/test_202311281818.zarr/group2/a
Listing /tmp/test_202311281818.zarr/group2/b
Listing /tmp/test_202311281818.zarr/group0
Listing /tmp/test_202311281818.zarr/group0/array
Listing /tmp/test_202311281818.zarr/group0/x
Listing /tmp/test_202311281818.zarr/group0/a
Listing /tmp/test_202311281818.zarr/group0/b
Opening /tmp/test_202311281818.zarr/.zgroup
Opening /tmp/test_202311281818.zarr/group0/.zattrs
Opening /tmp/test_202311281818.zarr/group0/.zgroup
Opening /tmp/test_202311281818.zarr/group0/a/.zarray
Opening /tmp/test_202311281818.zarr/group0/a/.zattrs
Opening /tmp/test_202311281818.zarr/group0/array/.zarray
Opening /tmp/test_202311281818.zarr/group0/array/.zattrs
Opening /tmp/test_202311281818.zarr/group0/b/.zarray
Opening /tmp/test_202311281818.zarr/group0/b/.zattrs
Opening /tmp/test_202311281818.zarr/group0/x/.zarray
Opening /tmp/test_202311281818.zarr/group0/x/.zattrs
Opening /tmp/test_202311281818.zarr/group1/.zattrs
Opening /tmp/test_202311281818.zarr/group1/.zgroup
Opening /tmp/test_202311281818.zarr/group1/a/.zarray
Opening /tmp/test_202311281818.zarr/group1/a/.zattrs
Opening /tmp/test_202311281818.zarr/group1/array/.zarray
Opening /tmp/test_202311281818.zarr/group1/array/.zattrs
Opening /tmp/test_202311281818.zarr/group1/b/.zarray
Opening /tmp/test_202311281818.zarr/group1/b/.zattrs
Opening /tmp/test_202311281818.zarr/group1/x/.zarray
Opening /tmp/test_202311281818.zarr/group1/x/.zattrs
Opening /tmp/test_202311281818.zarr/group2/.zattrs
Opening /tmp/test_202311281818.zarr/group2/.zgroup
Opening /tmp/test_202311281818.zarr/group2/a/.zarray
Opening /tmp/test_202311281818.zarr/group2/a/.zattrs
Opening /tmp/test_202311281818.zarr/group2/array/.zarray
Opening /tmp/test_202311281818.zarr/group2/array/.zattrs
Opening /tmp/test_202311281818.zarr/group2/b/.zarray
Opening /tmp/test_202311281818.zarr/group2/b/.zattrs
Opening /tmp/test_202311281818.zarr/group2/x/.zarray
Opening /tmp/test_202311281818.zarr/group2/x/.zattrs
Opening /tmp/test_202311281818.zarr/group3/.zattrs
Opening /tmp/test_202311281818.zarr/group3/.zgroup
Opening /tmp/test_202311281818.zarr/group3/a/.zarray
Opening /tmp/test_202311281818.zarr/group3/a/.zattrs
Opening /tmp/test_202311281818.zarr/group3/array/.zarray
Opening /tmp/test_202311281818.zarr/group3/array/.zattrs
Opening /tmp/test_202311281818.zarr/group3/b/.zarray
Opening /tmp/test_202311281818.zarr/group3/b/.zattrs
Opening /tmp/test_202311281818.zarr/group3/x/.zarray
Opening /tmp/test_202311281818.zarr/group3/x/.zattrs
Opening /tmp/test_202311281818.zarr/group4/.zattrs
Opening /tmp/test_202311281818.zarr/group4/.zgroup
Opening /tmp/test_202311281818.zarr/group4/a/.zarray
Opening /tmp/test_202311281818.zarr/group4/a/.zattrs
Opening /tmp/test_202311281818.zarr/group4/array/.zarray
Opening /tmp/test_202311281818.zarr/group4/array/.zattrs
Opening /tmp/test_202311281818.zarr/group4/b/.zarray
Opening /tmp/test_202311281818.zarr/group4/b/.zattrs
Opening /tmp/test_202311281818.zarr/group4/x/.zarray
Opening /tmp/test_202311281818.zarr/group4/x/.zattrs
Opening /tmp/test_202311281818.zarr/group5/.zattrs
Opening /tmp/test_202311281818.zarr/group5/.zgroup
Opening /tmp/test_202311281818.zarr/group5/a/.zarray
Opening /tmp/test_202311281818.zarr/group5/a/.zattrs
Opening /tmp/test_202311281818.zarr/group5/array/.zarray
Opening /tmp/test_202311281818.zarr/group5/array/.zattrs
Opening /tmp/test_202311281818.zarr/group5/b/.zarray
Opening /tmp/test_202311281818.zarr/group5/b/.zattrs
Opening /tmp/test_202311281818.zarr/group5/x/.zarray
Opening /tmp/test_202311281818.zarr/group5/x/.zattrs
Opening /tmp/test_202311281818.zarr/group6/.zattrs
Opening /tmp/test_202311281818.zarr/group6/.zgroup
Opening /tmp/test_202311281818.zarr/group6/a/.zarray
Opening /tmp/test_202311281818.zarr/group6/a/.zattrs
Opening /tmp/test_202311281818.zarr/group6/array/.zarray
Opening /tmp/test_202311281818.zarr/group6/array/.zattrs
Opening /tmp/test_202311281818.zarr/group6/b/.zarray
Opening /tmp/test_202311281818.zarr/group6/b/.zattrs
Opening /tmp/test_202311281818.zarr/group6/x/.zarray
Opening /tmp/test_202311281818.zarr/group6/x/.zattrs
Opening /tmp/test_202311281818.zarr/group7/.zattrs
Opening /tmp/test_202311281818.zarr/group7/.zgroup
Opening /tmp/test_202311281818.zarr/group7/a/.zarray
Opening /tmp/test_202311281818.zarr/group7/a/.zattrs
Opening /tmp/test_202311281818.zarr/group7/array/.zarray
Opening /tmp/test_202311281818.zarr/group7/array/.zattrs
Opening /tmp/test_202311281818.zarr/group7/b/.zarray
Opening /tmp/test_202311281818.zarr/group7/b/.zattrs
Opening /tmp/test_202311281818.zarr/group7/x/.zarray
Opening /tmp/test_202311281818.zarr/group7/x/.zattrs
Opening /tmp/test_202311281818.zarr/group8/.zattrs
Opening /tmp/test_202311281818.zarr/group8/.zgroup
Opening /tmp/test_202311281818.zarr/group8/a/.zarray
Opening /tmp/test_202311281818.zarr/group8/a/.zattrs
Opening /tmp/test_202311281818.zarr/group8/array/.zarray
Opening /tmp/test_202311281818.zarr/group8/array/.zattrs
Opening /tmp/test_202311281818.zarr/group8/b/.zarray
Opening /tmp/test_202311281818.zarr/group8/b/.zattrs
Opening /tmp/test_202311281818.zarr/group8/x/.zarray
Opening /tmp/test_202311281818.zarr/group8/x/.zattrs
Opening /tmp/test_202311281818.zarr/group9/.zattrs
Opening /tmp/test_202311281818.zarr/group9/.zgroup
Opening /tmp/test_202311281818.zarr/group9/a/.zarray
Opening /tmp/test_202311281818.zarr/group9/a/.zattrs
Opening /tmp/test_202311281818.zarr/group9/array/.zarray
Opening /tmp/test_202311281818.zarr/group9/array/.zattrs
Opening /tmp/test_202311281818.zarr/group9/b/.zarray
Opening /tmp/test_202311281818.zarr/group9/b/.zattrs
Opening /tmp/test_202311281818.zarr/group9/x/.zarray
Opening /tmp/test_202311281818.zarr/group9/x/.zattrs
called info /tmp/test_202311281818.zarr/.zmetadata
called makedirs /tmp/test_202311281818.zarr
Opening /tmp/test_202311281818.zarr/.zmetadata
Opening /tmp/test_202311281818.zarr/.zmetadata

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:43:22) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.0-10-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2023.11.0
pandas: 2.1.3
numpy: 1.26.2
scipy: 1.11.4
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.16.1
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.8.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.18.1
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions