-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened?
I am writing lots of geo data to disk in Zarr and Icechunk, usually through appending to a given Zarr store on disk. I noticed that recently, some of the data values have seemed flipped compared to what the dimension says they should be. I've made a minimal example to show this, the datasets appended in memory are correct, but the one appended to on disk has the level dimension updated to match the last appended timestep, resulting in previous timestep values to not match the dimensions.
What did you expect to happen?
I would expect when appending to a Zarr store on disk, Xarray would align the new data to append to match the current Zarr store's metadata. Or at least throw an error if they don't match, rather than silently updating the store's metadata.
Minimal Complete Verifiable Example
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "xarray[complete]@git+https://github.com/pydata/xarray.git@main",
# ]
# ///
#
# This script automatically imports the development branch of xarray to check for issues.
# Please delete this header if you have _not_ tested this script with `uv run`!
import xarray as xr
xr.show_versions()
import xarray as xr
import numpy as np
var1 = xr.DataArray(
coords={"time": [0,], "latitude": [0,1], "longitude": [0,1],
"level": np.arange(0, 5, 1)},
data=np.arange(0,20,1).reshape((1,2,2,5))
).to_dataset(name="var").chunk({"time": 1, "latitude": -1, "longitude": -1, "level": -1})
var1.to_zarr("test.zarr", compute=True, mode="w")
var1.to_zarr("test2.zarr", compute=True, mode="w")
var2 = xr.DataArray(
coords={"time": [1,], "latitude": [0,1], "longitude": [0,1],
"level": np.arange(4, -1, -1)},
data=np.arange(19,-1,-1).reshape((1,2,2,5))
).to_dataset(name="var").chunk({"time": 1, "latitude": -1, "longitude": -1, "level": -1})
var2.to_zarr("test.zarr", append_dim="time", compute=True, mode="a")
ds = xr.concat([var1, var2], dim="time")
zarr_ds = xr.open_zarr("test.zarr")
zarr_var1_ds = xr.open_zarr("test2.zarr")
print(ds)
print(zarr_ds)
print(zarr_var1_ds)
zarr_ds_var1 = zarr_ds.isel(time=[0,])
assert np.allclose(zarr_ds_var1["var"].values, zarr_var1_ds["var"].values)
xr.testing.assert_equal(ds, zarr_ds)Steps to reproduce
Run the above script.
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
/home/jacob/assimilation/.pixi/envs/gpu/lib/python3.12/site-packages/zarr/api/asynchronous.py:247: ZarrUserWarning:
Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
/home/jacob/assimilation/.pixi/envs/gpu/lib/python3.12/site-packages/zarr/api/asynchronous.py:247: ZarrUserWarning:
Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
/home/jacob/assimilation/check_zarr.py:22: FutureWarning:
In a future version of xarray the default value for join will change from join='outer' to join='exact'. This change will result in the following ValueError: cannot be aligned with join='exact' because index/labels/sizes are not equal along these coordinates (dimensions): 'level' ('level',) The recommendation is to set join explicitly for this case.
<xarray.Dataset> Size: 408B
Dimensions: (time: 2, latitude: 2, longitude: 2, level: 5)
Coordinates:
* time (time) int64 16B 0 1
* latitude (latitude) int64 16B 0 1
* longitude (longitude) int64 16B 0 1
* level (level) int64 40B 0 1 2 3 4
Data variables:
var (time, latitude, longitude, level) int64 320B dask.array<chunksize=(1, 2, 2, 5), meta=np.ndarray>
<xarray.Dataset> Size: 408B
Dimensions: (time: 2, latitude: 2, longitude: 2, level: 5)
Coordinates:
* time (time) int64 16B 0 1
* latitude (latitude) int64 16B 0 1
* longitude (longitude) int64 16B 0 1
* level (level) int64 40B 4 3 2 1 0
Data variables:
var (time, latitude, longitude, level) int64 320B dask.array<chunksize=(1, 2, 2, 5), meta=np.ndarray>
<xarray.Dataset> Size: 240B
Dimensions: (time: 1, latitude: 2, longitude: 2, level: 5)
Coordinates:
* time (time) int64 8B 0
* latitude (latitude) int64 16B 0 1
* longitude (longitude) int64 16B 0 1
* level (level) int64 40B 0 1 2 3 4
Data variables:
var (time, latitude, longitude, level) int64 160B dask.array<chunksize=(1, 2, 2, 5), meta=np.ndarray>
Traceback (most recent call last):
File "/home/jacob/assimilation/check_zarr.py", line 30, in <module>
xr.testing.assert_equal(ds, zarr_ds)
File "/home/jacob/assimilation/.pixi/envs/gpu/lib/python3.12/site-packages/xarray/testing/assertions.py", line 33, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/jacob/assimilation/.pixi/envs/gpu/lib/python3.12/site-packages/xarray/testing/assertions.py", line 147, in assert_equal
assert a.equals(b), formatting.diff_dataset_repr(a, b, "equals")
^^^^^^^^^^^
AssertionError: Left and right Dataset objects are not equal
Differing coordinates:
L * level (level) int64 40B 0 1 2 3 4
R * level (level) int64 40B 4 3 2 1 0
Differing data variables:
L var (time, latitude, longitude, level) int64 320B dask.array<chunksize=(1, 2, 2, 5), meta=np.ndarray>
R var (time, latitude, longitude, level) int64 320B dask.array<chunksize=(1, 2, 2, 5), meta=np.ndarray>Anything else we need to know?
No response