This repository was archived by the owner on Oct 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 41
This repository was archived by the owner on Oct 24, 2024. It is now read-only.
to_zarr() is extremely slow writing to high latency store #277
Copy link
Copy link
Closed
Labels
IORepresentation of particular file formats as treesRepresentation of particular file formats as trees
Description
Unbearably so, I would say. Here is an example with a tree containing 13 nodes and negligible data, trying to write to S3/GCS with fsspec:
import numpy as np
import xarray as xr
from datatree import DataTree
ds = xr.Dataset(
data_vars={
"a": xr.DataArray(np.ones((2, 2)), coords={"x": [1, 2], "y": [1, 2]}),
"b": xr.DataArray(np.ones((2, 2)), coords={"x": [1, 2], "y": [1, 2]}),
"c": xr.DataArray(np.ones((2, 2)), coords={"x": [1, 2], "y": [1, 2]}),
}
)
dt = DataTree()
for first_level in [1, 2, 3]:
dt[f"{first_level}"] = DataTree(ds)
for second_level in [1, 2, 3]:
dt[f"{first_level}/{second_level}"] = DataTree(ds)
%time dt.to_zarr("test.zarr", mode="w")
bucket = "s3|gs://your-bucket/path"
%time dt.to_zarr(f"{bucket}/test.zarr", mode="w")Gives:
CPU times: user 53.8 ms, sys: 3.95 ms, total: 57.8 ms
Wall time: 58 ms
CPU times: user 6.33 s, sys: 211 ms, total: 6.54 s
Wall time: 3min 20s
I suspect one of the culprits may be that we're having to reopen the store without consolidated metadata on writing each node:
Lines 205 to 223 in 433f78d
| for node in dt.subtree: | |
| ds = node.ds | |
| group_path = node.path | |
| if ds is None: | |
| _create_empty_zarr_group(store, group_path, mode) | |
| else: | |
| ds.to_zarr( | |
| store, | |
| group=group_path, | |
| mode=mode, | |
| encoding=encoding.get(node.path), | |
| consolidated=False, | |
| **kwargs, | |
| ) | |
| if "w" in mode: | |
| mode = "a" | |
| if consolidated: | |
| consolidate_metadata(store) |
Any ideas for easy improvements here?
Metadata
Metadata
Assignees
Labels
IORepresentation of particular file formats as treesRepresentation of particular file formats as trees