-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.chunk() does not overwrite encoding["chunks"] #8062
Comments
I suspect this may be a regression introduced by #7019 |
…ncoding attribute
I have attempted to write a fix for this, see PR #8069, but some failing tests in test_backends seem to point to some further issues regarding the setting of the encoding attribute: test |
Yes, we do not want to update encoding. We are planning on getting rid of it soon (#6323 (comment)) you should be able to use
We should delete that reference to |
Ah, I see. I was not aware of the intention to deprecate the encoding attribute. However, I think the current state of the I can think of 4 solutions to the current situation, and I am willing to edit my PR to fit whatever resolution is deemed best:
My problem with the last option is also why calling Regarding the proposed future without the I will try and update my PR as soon as possible after further input :) |
<!-- Please ensure the PR fulfills the following requirements! --> <!-- If this is your first PR, make sure to add your details to the AUTHORS.rst! --> ### Pull Request Checklist: - [x] This PR addresses an already opened issue (for bug fixes / features) - This PR fixes #xyz - [x] (If applicable) Documentation has been added / updated (for bug fixes / features). - [ ] (If applicable) Tests have been added. - [x] This PR does not seem to break the templates. - [x] CHANGES.rst has been updated (with summary of main changes). - [x] Link to issue (:issue:`number`) and pull request (:pull:`number`) has been added. ### What kind of change does this PR introduce? * `original_shape` and `chunksizes` don't play well together. This PR makes sure that `original_shape` is always removed before saving a dataset. * Also, (maybe new in the latest version of `xarray` and engine `netcdf4`?), it appears that dropping `chunksizes` leads to unexpected behaviours, such as bloated file size and incorrect chunking on disk. Thus, the `chunksizes` encoding was made more explicit. ### Does this PR introduce a breaking change? * No. ### Other information: Related Issues: pydata/xarray#8385 pydata/xarray#8062
Closing as unlikely to inspire change, please reopen if anyone disagrees |
What happened?
When using the
chunk
function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation ofchunk
), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of theencoding
attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).Looking at the implementation of
chunk
, every variable is rechunked using the_maybe_chunk
function, which actually has the parameteroverwrite_encoded_chunks
to control just this behavior. However, it is an optional parameter which defaults to False, and the call inchunk
does not provide a value for this parameter, nor does it offer the caller to influence it (by having anoverwrite_encoded_chunks
parameter itself, for example).I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of
to_zarr
:Which is exactly what it doesn't.
What did you expect to happen?
I would expect the "chunks" entry of the
encoding
attribute to be changed to reflect the new chunking scheme.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.16.3-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.7
libnetcdf: 4.8.1
xarray: 2023.7.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.0
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.12.0
h5py: 3.6.0
Nio: None
zarr: 2.14.1
cftime: 1.5.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.6
dask: 2022.01.0+dfsg
distributed: 2022.01.0+ds.1
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 59.6.0
pip: 23.2.1
conda: None
pytest: 7.2.2
mypy: 1.1.1
IPython: 7.31.1
sphinx: None
The text was updated successfully, but these errors were encountered: