Skip to content

Ambiguous behavior with coordinates when appending to Zarr store with append_dim #8427

Closed
@rabernat

Description

@rabernat

What happened?

There are two quite different scenarios covered by "append" with Zarr

  • Adding new variables to a dataset
  • Extending arrays along a dimensions (via append_dim)

This issue is about what should happen when using append_dim with variables that do not contain append_dim.

Here's the current behavior.

import xarray as xr
import zarr

ds1 = xr.DataArray(
    np.array([1, 2, 3]).reshape(3, 1, 1),
    dims=('time', 'y', 'x'),
    coords={'x': [1], 'y': [2]},
    name="foo"
).to_dataset()

ds2 = xr.DataArray(
    np.array([4, 5]).reshape(2, 1, 1),
    dims=('time', 'y', 'x'),
    coords={'x':[-1], 'y': [-2]},
    name="foo"
).to_dataset()

# how concat works: data are aligned
ds_concat = xr.concat([ds1, ds2], dim="time")
assert ds_concat.dims == {"time": 5, "y": 2, "x": 2}

# now do a Zarr append
store = zarr.storage.MemoryStore()
ds1.to_zarr(store, consolidated=False)
# we do not check that the coordinates are aligned--just that they have the same shape and dtype
ds2.to_zarr(store, append_dim="time", consolidated=False)
ds_append = xr.open_zarr(store, consolidated=False)

# coordinates data have been overwritten 
assert ds_append.dims == {"time": 5, "y": 1, "x": 1}
# ...with the latest values
assert ds_append.x.data[0] == -1

Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect.

What did you expect to happen?

There are a couple of different options we could consider for how to handle this "extending" situation (with append_dim)

  1. [current behavior] Do not attempt to align coordinates
    a. [current behavior] Overwrite coordinates with new data
    b. Keep original coordinates
    c. Force the user to explicitly drop the coordinates, as we do for region operations.
  2. Attempt to align coordinates
    a. Fail if coordinates don't match
    b. Extend the arrays to replicate the behavior of concat

We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.176-157.645.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2

xarray: 2023.10.1
pandas: 2.1.2
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.1
distributed: 2023.10.1
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: 0.13.0
numbagg: 0.6.0
fsspec: 2023.10.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.16.1
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions