Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.chunk() does not overwrite encoding["chunks"] #8062

Closed
4 tasks done
Metamess opened this issue Aug 10, 2023 · 5 comments
Closed
4 tasks done

Dataset.chunk() does not overwrite encoding["chunks"] #8062

Metamess opened this issue Aug 10, 2023 · 5 comments
Labels
plan to close May be closeable, needs more eyeballs

Comments

@Metamess
Copy link
Contributor

What happened?

When using the chunk function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation of chunk), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of the encoding attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).

Looking at the implementation of chunk, every variable is rechunked using the _maybe_chunk function, which actually has the parameter overwrite_encoded_chunks to control just this behavior. However, it is an optional parameter which defaults to False, and the call in chunk does not provide a value for this parameter, nor does it offer the caller to influence it (by having an overwrite_encoded_chunks parameter itself, for example).

I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of to_zarr:

Zarr chunks are determined in the following way:
From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).

Which is exactly what it doesn't.

What did you expect to happen?

I would expect the "chunks" entry of the encoding attribute to be changed to reflect the new chunking scheme.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

# Create a test Dataset with dimension x and y, each of size 100, and a chunksize of 50
ds_original = xr.Dataset({"my_var": (["x", "y"], np.random.randn(100, 100))})
# Since 'chunk' does not work, manually set encoding
ds_original .my_var.encoding["chunks"] = (50, 50)
# To best showcase the real-life example, write it to file and read it back again.
# The same could be achieved by just calling .chunk() with chunksizes of 25, but this feels more 'complete'
filepath = "~/chunk_test.zarr"
ds_original.to_zarr(filepath)
ds = xr.open_zarr(filepath)

# Check the chunksizes and "chunks" encoding
print(ds.my_var.chunks)
# >>> ((50, 50), (50, 50))
print(ds.my_var.encoding["chunks"])
# >>> (50, 50)

# Rechunk the Dataset
ds = ds.chunk({"x": 25, "y": 25})
# The chunksizes have changed
print(ds.my_var.chunks)
# >>> ((25, 25, 25, 25), (25, 25, 25, 25))
# But the encoding value remains the same
print(ds.my_var.encoding["chunks"])
# >>> (50, 50)

# Attempting to write this back to zarr raises an error
ds.to_zarr("~/chunk_test_rechunked.zarr")
# NotImplementedError: Specified zarr chunks encoding['chunks']=(50, 50) for variable named 'my_var' would overlap multiple dask chunks ((25, 25, 25, 25), (25, 25, 25, 25)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.16.3-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.7
libnetcdf: 4.8.1

xarray: 2023.7.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.0
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.12.0
h5py: 3.6.0
Nio: None
zarr: 2.14.1
cftime: 1.5.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.6
dask: 2022.01.0+dfsg
distributed: 2022.01.0+ds.1
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 59.6.0
pip: 23.2.1
conda: None
pytest: 7.2.2
mypy: 1.1.1
IPython: 7.31.1
sphinx: None

@Metamess Metamess added bug needs triage Issue that has not been reviewed by xarray team member labels Aug 10, 2023
@TomNicholas
Copy link
Member

I suspect this may be a regression introduced by #7019

@Metamess
Copy link
Contributor Author

I have attempted to write a fix for this, see PR #8069, but some failing tests in test_backends seem to point to some further issues regarding the setting of the encoding attribute: test xarray/tests/test_backends.py::TestZarrDictStore::test_chunk_encoding_with_dask seems to directly assume that .chunk() does not set any encoding, and I did not want to overrule that assumption without further input

@dcherian
Copy link
Contributor

Yes, we do not want to update encoding. We are planning on getting rid of it soon (#6323 (comment)) you should be able to use .reset_encoding() to get rid of the errors.

From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).

We should delete that reference to Dataset.chunk

@dcherian dcherian added plan to close May be closeable, needs more eyeballs and removed needs triage Issue that has not been reviewed by xarray team member bug labels Aug 14, 2023
@Metamess
Copy link
Contributor Author

Metamess commented Aug 14, 2023

Ah, I see. I was not aware of the intention to deprecate the encoding attribute. However, I think the current state of the .chunk() function is still undesirable, as it leaves the Dataset in a 'broken' state.

I can think of 4 solutions to the current situation, and I am willing to edit my PR to fit whatever resolution is deemed best:

  • Set encoding["chunks"] as part of .chunk(). When the encoding attribute gets removed from the codebase, this behavior will be taken out again. Until then it would appear (to me, at least) as the 'expected' result of the .chunk() function, given the existence of encoding. (This is the current intent in my PR)
  • Only update encoding["chunks"] as part of .chunk() if it is already present on the variable. This would effectively only "fix" the encoding attribute where it exists. No encoding is added, but a Dataset with configured encoding["chunks"] does not suddenly lose it after passing through .chunk(). (Feels a bit inconsistent to me)
  • Remove any encoding["chunks"] entries when .chunk() is called. Provides a 'clean slate', where chunking is now defined by the new chunk sizes of the Dask arrays. Any existing values as part of encoding are made obsolete by calling .chunk() and are thus removed. Would work towards a future without encoding in terms of program flow. (This would be my preferred solution, given my current understanding of the state of things)
  • All encoding is removed when .chunks() is called, possibly by added a call to .reset_encoding() to the function. Aggressively moves towards a future without encoding. Has the (possibly significant) downside of removing any other properties stored in encoding, which would probably be unexpected behavior for users. (I would not be in favor of this)

My problem with the last option is also why calling .reset_encoding() after a call to .chunk() is undesirable to me, as in my use case there are more properties stored in encoding that I do not want to lose when changing the chunk sizes.

Regarding the proposed future without the encoding attribute, with any encoding stored in a separate object: Consider an application where a Dataset is created in one part, then goes through several stages of processing, and is finally written to a file. It would be a pain to have to pass an encoding object alongside it, in every function, just so that the encoding is not lost along the way, while it is only required at the end during the write stage. I do not expect to move the needle on the overall decision with this comment, but I hope it can serve as an argument for why a built-in solution that does not simply fully clear encoding may have some merit.

I will try and update my PR as soon as possible after further input :)

RondeauG added a commit to Ouranosinc/xscen that referenced this issue Apr 11, 2024
<!-- Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #xyz
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features).
- [ ] (If applicable) Tests have been added.
- [x] This PR does not seem to break the templates.
- [x] CHANGES.rst has been updated (with summary of main changes).
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added.

### What kind of change does this PR introduce?

* `original_shape` and `chunksizes` don't play well together. This PR
makes sure that `original_shape` is always removed before saving a
dataset.
* Also, (maybe new in the latest version of `xarray` and engine
`netcdf4`?), it appears that dropping `chunksizes` leads to unexpected
behaviours, such as bloated file size and incorrect chunking on disk.
Thus, the `chunksizes` encoding was made more explicit.

### Does this PR introduce a breaking change?

* No.


### Other information:

Related Issues:
pydata/xarray#8385
pydata/xarray#8062
@max-sixty
Copy link
Collaborator

Closing as unlikely to inspire change, please reopen if anyone disagrees

@max-sixty max-sixty closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants