Description
What happened?
I want to work with a huge dataset stored in hdf5 loaded in chunks. Each chunk contains part of my data that should be saved to a specific region of zarr files. I need to follow the original order of chunks.
I found it a convenient way to use a map_blocks
function for this purpose. However, I have missing values in the final zarr file. Some chunks or parts of chunks are not stored.
I used a simplified scenario for code documenting this behavior. The initial zarr file of zeros is filled with ones. There are always some parts where there are still zeros.
What did you expect to happen?
No response
Minimal Complete Verifiable Example
import os
import shutil
import xarray as xr
import numpy as np
import dask.array as da
xr.show_versions()
zarr_file = "file.zarr"
if os.path.exists(zarr_file):
shutil.rmtree(zarr_file)
chunk_size = 5
shape = (50, 32, 1000)
ones_dataset = xr.Dataset({"data": xr.ones_like(xr.DataArray(np.empty(shape)))})
ones_dataset = ones_dataset.chunk({'dim_0': chunk_size})
chunk_indices = np.arange(len(ones_dataset.chunks['dim_0']))
chunk_ids = np.repeat(np.arange(ones_dataset.sizes["dim_0"] // chunk_size), chunk_size)
chunk_ids_dask_array = da.from_array(chunk_ids, chunks=(chunk_size,))
# Append the chunk IDs Dask array as a new variable to the existing dataset
ones_dataset['chunk_id'] = (('dim_0',), chunk_ids_dask_array)
# Create a new dataset filled with zeros
zeros_dataset = xr.Dataset({"data": xr.zeros_like(xr.DataArray(np.empty(shape)))})
zeros_dataset.to_zarr(zarr_file, compute=False)
def process_chunk(chunk_dataset):
chunk_id = int(chunk_dataset["chunk_id"][0])
chunk_dataset_to_store = chunk_dataset.drop_vars("chunk_id")
start_index = chunk_id * chunk_size
end_index = chunk_id * chunk_size + chunk_size
chunk_dataset_to_store.to_zarr(zarr_file, region={'dim_0': slice(start_index, end_index)})
return chunk_dataset
ones_dataset.map_blocks(process_chunk, template=ones_dataset).compute()
# Load data stored in zarr
zarr_data = xr.open_zarr(zarr_file, chunks={'dim_0': chunk_size})
# Find differences
for var_name in zarr_data.variables:
try:
xr.testing.assert_equal(zarr_data[var_name], ones_dataset[var_name])
except AssertionError:
print(f"Differences in {var_name}:")
print(zarr_data[var_name].values)
print(ones_dataset[var_name].values)
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
No response
Anything else we need to know?
No response
Environment
xarray: 2024.1.1
pandas: 2.1.4
numpy: 1.26.3
scipy: 1.11.4
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.3.7
dask: 2024.1.1
distributed: 2024.1.0
matplotlib: 3.8.2
cartopy: 0.22.0
seaborn: 0.13.1
numbagg: 0.6.8
fsspec: 2023.12.2
cupy: None
pint: None
sparse: None
flox: 0.8.9
numpy_groupies: 0.10.2
setuptools: 69.0.2
pip: 23.3.1
conda: None
pytest: 7.4.4
mypy: None
IPython: None
sphinx: None