Skip to content

Zstd: Don't persist the checksum param if false #681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

normanrz
Copy link
Member

@normanrz normanrz commented Jan 6, 2025

Changes the zstd codec to not persist the checksum param, if it is set to False (the default). This is aimed at improving backwards compatibility.

See zarr-developers/zarr-python#2647 (comment)

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

Copy link

codecov bot commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.92%. Comparing base (6c0ea0f) to head (93584af).
Report is 23 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #681   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files          62       62           
  Lines        2752     2756    +4     
=======================================
+ Hits         2750     2754    +4     
  Misses          2        2           
Files with missing lines Coverage Δ
numcodecs/tests/test_zstd.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@normanrz
Copy link
Member Author

normanrz commented Jan 6, 2025

@dstansby Do you know why the macos tests are failing? I don't have that issue on my local mac.

@normanrz normanrz marked this pull request as ready for review January 6, 2025 15:37
@normanrz normanrz requested a review from dstansby January 6, 2025 15:46
@normanrz normanrz self-assigned this Jan 6, 2025
@dstansby
Copy link
Contributor

dstansby commented Jan 6, 2025

It's the same failure at #679, so not related to either PR as far as I can see.

@dstansby
Copy link
Contributor

dstansby commented Jan 6, 2025

re. this PR, I'm 👎 to changing the way this parameter is serialised, because it makes the serialisation logic more complex and potentially unituitive.

I think the right fix to zarr-developers/zarr-python#2647 is special casing how zarr chooses to seralise the zstd configuration when reading/writing zarr v2 data.

@normanrz
Copy link
Member Author

normanrz commented Jan 6, 2025

I don't understand how special-casing in the client library (zarr-python) is any less complex than this small fix?

I don't know what numcodecs' policy for backwards compatibility is, but conditionally adding parameters seems like a good way of dealing with that.

@dstansby
Copy link
Contributor

dstansby commented Jan 6, 2025

Yes, complexity of the fix would be the same, but numcodecs is used by a much wider range of packages than just zarr-python. Because this change introduces non-intuitive behaviour (only serialising a compression parameter if it has a particular value) that is not done for any other codecs in numcodecs, it should be down to zarr-python to choose a different way of serialising if it wants to.

@normanrz
Copy link
Member Author

normanrz commented Jan 6, 2025

but numcodecs is used by a much wider range of packages than just zarr-python.

Is it really?

@dstansby
Copy link
Contributor

dstansby commented Jan 6, 2025

2900 repostories and 136 packages according to GitHub: https://github.com/zarr-developers/numcodecs/network/dependents

@normanrz
Copy link
Member Author

normanrz commented Jan 6, 2025

I suppose most of them have a transitive dependency through zarr.

Anyways. I don't care too strongly about this issue.

@dstansby
Copy link
Contributor

dstansby commented Jan 6, 2025

In that case shall we close this, and I can add a note to zarr-developers/zarr-python#2647 that a fix/change to the codec serialisation should go in zarr-python?

@normanrz
Copy link
Member Author

normanrz commented Jan 6, 2025

Closing in favor of zarr-developers/zarr-python#2655

@normanrz normanrz closed this Jan 6, 2025
@normanrz normanrz deleted the zstd-no-checksu branch January 6, 2025 16:40
@nhz2
Copy link

nhz2 commented Jun 4, 2025

Not having this change in numcodecs makes things more complicated for other Zarr v2 implementations. Ref: JuliaIO/Zarr.jl#193. IMO, not being able to use numcodecs as a source of truth for this makes compatibility testing trickier than it has to be.

Also, since zarr v2 data with "checksum": false already exists in the wild (and are still produced by zarr-python 2.18.7), surely TensorStore needs to be updated to handle parsing compressor configs in a more flexible way.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 4, 2025

IMO, not being able to use numcodecs as a source of truth for this makes compatibility testing trickier than it has to be.

the current behavior of numcodecs should not be a source of truth for the broader zarr community. that might have been sustainable several years ago when there were fewer zarr implementations, but today is different and we need to approach this problem outside numcodecs.

to take this particular example, I don't think it's reasonable to expect the implementers of zarr v2 in different programming languages to watch for breaking changes lurking in numcodecs releases.

instead of relying on the behavior of numcodecs, we should use specification documents, even (or especially) written retroactively. We have a repo for this: https://github.com/zarr-developers/zarr-extensions/ it doesn't have any zarr v2 stuff there, but IMO it should.

@nhz2
Copy link

nhz2 commented Jun 4, 2025

Yes, having more explicit documentation is great.

Also, I'm confused why adding a new optional config property can be considered a breaking change.

Zarr.jl will ignore any compressor properties it doesn't know about, so it would not be broken by this kind of change.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 6, 2025

Yes, having more explicit documentation is great.

Also, I'm confused why adding a new optional config property can be considered a breaking change.

Zarr.jl will ignore any compressor properties it doesn't know about, so it would not be broken by this kind of change.

Numcodecs does not define a data model that distinguishes optional from non-optional keys in JSON documents. When your software consumes a .zarray that contains blosc metadata with a new key, your software cannot tell if that key is optional or not. So while you may be avoiding one kind of error by ignoring unknown keys that happen to be optional, you are inviting a second kind of error if those unknown keys are required.

Suppose numcodecs adds a new, required field to the blosc JSON metadata. Your software will silently ignore this field, and this will very likely result in an obscure error or undefined behavior when a user tries to read or write data to that array.

Thinking beyond this particular issue, it's worth remembering that the "blosc codec" is little more than 2 separate functions -- an encoding function, and a decoding function. We have chosen to glue these two functions together into a single object, which we call a "codec", but in reality the read and write functions might take different sets of parameters, and this could be reflected in the metadata explicitly, via something like this:

{
  "name": "blosc",
  "configuration": {"encode": {...}, "decode": {...}}
}

If your application was accessing data in a read-only mode, then it could safely ignore entirely the contents of blosc.configuration.encode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants