Skip to content

to_zarr silently loses data when using append_dim, if chunks are different to zarr store #8882

Closed
@harryC-space-intelligence

Description

What happened?

When writing a chunked DataArray to an existing zarr store, appending along an existing dimension of the store, I have found that some data are not written if there are multiple array chunks to one zarr chunk.

I appreciate it is probably bad practice to have different chunksizes in my DataArray and zarr_store, but I think its a realistic scenario that needs to be caught.

This may be related to / the same underlying issue as #8371. Perhaps the checks mentioned in #8371 (comment) are somehow getting bypassed? Using zarr's ThreadSynchronizer is the only way I have found to ensure that all the data gets written.

What did you expect to happen?

I expected that either

  • to_zarr would recognise the different chunk sizes, and re-chunk or wait for all the chunks to be written
  • or an error would be raised, given that the results result in loss of data in an unpredictable way

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
from matplotlib import pyplot as plt

x_coords = np.arange(10)
y_coords = np.arange(10)
t_coords = np.array([np.datetime64('2020-01-01').astype('datetime64[ns]')])
data = np.ones((10,10))

for i in range(4):
    plt.subplot(1,4,i+1)
    
    da = xr.DataArray(data.reshape((-1,10,10)),
                      dims = ['time','x','y'],
                      coords = {'x':x_coords, 'y':y_coords, 'time':t_coords},
                     ).chunk({'x':5, 'y':5,'time':1}).rename('foo')
    
    da.to_zarr('foo.zarr', mode='w')
    
    new_time = np.array([np.datetime64('2021-01-01').astype('datetime64[ns]')])
    
    da2 = xr.DataArray(data.reshape((-1,10,10)),
                      dims = ['time','x','y'],
                      coords = {'x':x_coords, 'y':y_coords, 'time':new_time},
                     ).chunk({'x':1, 'y':1,'time':1}).rename('foo')
    
    da2.to_zarr('foo.zarr',append_dim='time', mode='a')
    
    plt.imshow(xr.open_zarr('foo.zarr').isel(time=-1).foo.values)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

Output from the plots above:

image

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-1041-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.2.0
pandas: 2.2.1
numpy: 1.26.4
scipy: 1.12.0
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: 2.17.1
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.3.8
dask: 2024.3.1
distributed: 2024.3.1
matplotlib: 3.8.3
cartopy: 0.22.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.3.1
cupy: None
pint: 0.23
sparse: 0.15.1
flox: 0.9.5
numpy_groupies: 0.10.2
setuptools: 69.2.0
pip: 24.0
conda: 24.1.2
pytest: 8.1.1
mypy: None
IPython: 8.22.2
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugtopic-zarrRelated to zarr storage library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions