Description
What happened?
There are two quite different scenarios covered by "append" with Zarr
- Adding new variables to a dataset
- Extending arrays along a dimensions (via
append_dim
)
This issue is about what should happen when using append_dim
with variables that do not contain append_dim
.
Here's the current behavior.
import xarray as xr
import zarr
ds1 = xr.DataArray(
np.array([1, 2, 3]).reshape(3, 1, 1),
dims=('time', 'y', 'x'),
coords={'x': [1], 'y': [2]},
name="foo"
).to_dataset()
ds2 = xr.DataArray(
np.array([4, 5]).reshape(2, 1, 1),
dims=('time', 'y', 'x'),
coords={'x':[-1], 'y': [-2]},
name="foo"
).to_dataset()
# how concat works: data are aligned
ds_concat = xr.concat([ds1, ds2], dim="time")
assert ds_concat.dims == {"time": 5, "y": 2, "x": 2}
# now do a Zarr append
store = zarr.storage.MemoryStore()
ds1.to_zarr(store, consolidated=False)
# we do not check that the coordinates are aligned--just that they have the same shape and dtype
ds2.to_zarr(store, append_dim="time", consolidated=False)
ds_append = xr.open_zarr(store, consolidated=False)
# coordinates data have been overwritten
assert ds_append.dims == {"time": 5, "y": 1, "x": 1}
# ...with the latest values
assert ds_append.x.data[0] == -1
Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect.
What did you expect to happen?
There are a couple of different options we could consider for how to handle this "extending" situation (with append_dim
)
- [current behavior] Do not attempt to align coordinates
a. [current behavior] Overwrite coordinates with new data
b. Keep original coordinates
c. Force the user to explicitly drop the coordinates, as we do forregion
operations. - Attempt to align coordinates
a. Fail if coordinates don't match
b. Extend the arrays to replicate the behavior ofconcat
We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.176-157.645.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2
xarray: 2023.10.1
pandas: 2.1.2
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.1
distributed: 2023.10.1
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: 0.13.0
numbagg: 0.6.0
fsspec: 2023.10.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.16.1
sphinx: None