Description
My assumption for this is that it should be possible to:
- Write to a zarr store with some chunk size along a dimension
- Load from that zarr store and rechunk to a multiple of that chunk size
- Write that result to another zarr store
However I see this behavior instead:
import xarray as xr
import dask.array as da
ds = xr.Dataset(dict(
x=xr.DataArray(da.random.random(size=100, chunks=10), dims='d1')
))
# Write the store
ds.to_zarr('/tmp/ds1.zarr', mode='w')
# Read it out, rechunk it, and attempt to write it again
xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')
ValueError: Final chunk of Zarr array must be the same size or smaller than the first.
Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20)
in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding.
Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.
Full trace
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) in ----> 1 xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')/opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
1656 compute=compute,
1657 consolidated=consolidated,
-> 1658 append_dim=append_dim,
1659 )
1660/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
1351 writer = ArrayWriter()
1352 # TODO: figure out how to properly handle unlimited_dims
-> 1353 dump_to_store(dataset, zstore, writer, encoding=encoding)
1354 writes = writer.sync(compute=compute)
1355/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
1126 variables, attrs = encoder(variables, attrs)
1127
-> 1128 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
1129
1130/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
411 self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
412 self.set_variables(
--> 413 variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims
414 )
415/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
466 # new variable
467 encoding = extract_zarr_variable_encoding(
--> 468 v, raise_on_invalid=check, name=vn
469 )
470 encoded_attrs = {}/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in extract_zarr_variable_encoding(variable, raise_on_invalid, name)
214
215 chunks = _determine_zarr_chunks(
--> 216 encoding.get("chunks"), variable.chunks, variable.ndim, name
217 )
218 encoding["chunks"] = chunks/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name)
154 if dchunks[-1] > zchunk:
155 raise ValueError(
--> 156 "Final chunk of Zarr array must be the same size or "
157 "smaller than the first. "
158 f"Specified Zarr chunk encoding['chunks']={enc_chunks_tuple}, "ValueError: Final chunk of Zarr array must be the same size or smaller than the first. Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. Consider either rechunking using
chunk()
or instead deleting or modifyingencoding['chunks']
.
Overwriting chunks on open_zarr
with overwrite_encoded_chunks=True
works but I don't want that because it requires providing a uniform chunk size for all variables. This workaround seems to be fine though:
ds = xr.open_zarr('/tmp/ds1.zarr')
del ds.x.encoding['chunks']
ds.chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')
Does encoding['chunks']
serve any purpose after you've loaded a zarr store and all the variables are defined as dask arrays? In other words, Is there any harm in deleting it from all dask variables if I want those variables to write back out to zarr using the dask chunk definitions instead?
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: Nonexarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.2
pytest: 5.4.3
IPython: 7.15.0
sphinx: 3.2.1