ZEP9: Parse Metadata Objects #2866

brokkoli71 · 2025-02-26T10:36:17Z

This PR implements the following aspects of ZEP9 (Phase 1)

If metadata JSON contains invalid keys, or if a value object contains invalid keys, the zarr array should be rejected unless it contains `"must_understand": false``.
It should be possible to specify a value in metadata in the following ways
- as a JSON object
- As a string value (only if the object would have no attributes other than its name).

this will help to enable zarr extensions

TODOs in code:

read metadata fails accordingly (+test)
specifying metadata value works as string (+test)
- in particular bytes codec should not be possible to specify as string value if it requires the "endian" argument for multi-byte data types

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…derstand=False

…rstand=False

normanrz

This looks really good!

src/zarr/codecs/blosc.py

src/zarr/core/metadata/v3.py

…sions

normanrz

Thanks!

d-v-b · 2025-05-13T12:18:19Z

what are the current use cases for this?

normanrz · 2025-05-16T07:49:03Z

what are the current use cases for this?

There are a number of extensions appearing now in zarr-extensions. Potentially, there will also be modifications to existing extensions such as added attributes in the metadata. zarr-python should fail if it encounters unknown extensions or attributes, unless marked with must_understand=false. This PR helps with that.

d-v-b · 2025-05-16T07:55:12Z

as I understand it, all of the extensions proposed in zarr-extensions have static JSON representations. Is this not true?

And this PR seems to allow the possibility that any one of those JSON representations might gain new fields, which should be ignored iff those fields are JSON objects containing the {"must_understand" : False}. That seems kind weird to me -- why would we allow an entirely optional subset of, e.g., the gzip codec JSON? I was under the impression that for codecs and data types, the JSON representation contained everything you need to know to make sense of the data type. I don't see how optional, ignorable fields fit into that model.

normanrz · 2025-05-16T08:09:36Z

There may be fields that are not strictly necessary for reading data that could be marked as optional. An example might be a "chunk_layout" in the sharding codec to denote how the chunks are ordered in the shard, e.g. morton, c, random etc.. While useful when writing, it is not necessary for reading because all chunk offsets are stored in the index.

Additionally, there may be new optional fields that are added to the root of the array or group metadata through a ZEP.

d-v-b · 2025-05-16T08:23:34Z

There may be fields that are not strictly necessary for reading data that could be marked as optional. An example might be a "chunk_layout" in the sharding codec to denote how the chunks are ordered in the shard, e.g. morton, c, random etc.. While useful when writing, it is not necessary for reading because all chunk offsets are stored in the index.

Additionally, there may be new optional fields that are added to the root of the array or group metadata through a ZEP.

these examples are not yet in use, which is why I asked what the current use cases are. This PR makes changes to how metadata is parsed (e.g., checking the contents of gzip json metadata) that, as far as I can tell, have no use.

Currently, I think people can expect that zarr-python can round-trip zarr data. To me, that means that if zarr-python can read zarr data from one place, it should be able to create a structurally identical copy of that zarr data somewhere else. The concept in this PR -- that we would support extra metadata fields in any metadata object which can be ignored when reading -- violates this expectation. So I think we need to have a larger conversation about what these hypothetical optional metadata fields mean for zarr-python before we add support for them. Until there are real examples out there of metadata with these must_understand=False fields, I don't feel like I can properly evaluate this PR.

normanrz · 2025-05-16T08:27:46Z

Fair enough. Then we should scope this PR to

enforcing that unknown fields lead to an error
accepting both name-only and object notations of extensions

prepare

d3b19cf

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 26, 2025

Merge branch 'main' into zarr-extensions

50cd5e0

brokkoli71 changed the title ~~prepare~~ ZEP9: Parse Metadata Objects Feb 26, 2025

brokkoli71 and others added 16 commits February 28, 2025 12:18

check for unexpected zarr metadata keys and codec configuration

786669c

format

26b658a

Merge branch 'main' into zarr-extensions

1e50587

Merge branch 'main' into zarr-extensions

4967003

if data type has endianness, then codecs must specify endian attribute

72a28e2

codec.from_dict does not select endian automatically

ac9f8d5

Merge branch 'main' into zarr-extensions

36c4d33

fix for single byte data types

54c13a0

fix test_fail_on_invalid_key

d46176e

add testcase for test_codec_requires_endian

a38f25e

metadata: unknown configuration keys will get rejected except must_un…

0c82437

…derstand=False

codecs: unknown configuration keys will get rejected except must_unde…

de8f5b1

…rstand=False

fix test_special_float_fill_values

151796f

fix kwargs typing

0353ae9

objects for datatype, chunk_key_encodings, chunk_grid

08fa7f5

document changes

0cef5e6

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 9, 2025

Merge branch 'main' into zarr-extensions

5ad18ec

brokkoli71 marked this pull request as ready for review April 9, 2025 14:23

normanrz reviewed Apr 9, 2025

View reviewed changes

src/zarr/codecs/blosc.py Outdated Show resolved Hide resolved

src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved

brokkoli71 added 4 commits April 10, 2025 15:07

extract helper reject_must_understand_metadata

385b7ca

Merge remote-tracking branch 'origin/zarr-extensions' into zarr-exten…

8812479

…sions

fix circular import

0d91dc6

set kwargs type

833f408

normanrz approved these changes Apr 10, 2025

View reviewed changes

fix test_fail_on_invalid_metadata_key

bc80e51

brokkoli71 added 4 commits April 10, 2025 16:34

Merge branch 'main' into zarr-extensions

c801088

Merge branch 'main' into zarr-extensions

c9f6ae4

Merge branch 'main' into zarr-extensions

9048e03

Merge branch 'main' into zarr-extensions

2f67624

brokkoli71 marked this pull request as draft May 27, 2025 11:42

d-v-b mentioned this pull request Jul 1, 2025

support plain string form for codecs, data types, etc #3188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ZEP9: Parse Metadata Objects #2866

ZEP9: Parse Metadata Objects #2866

Uh oh!

brokkoli71 commented Feb 26, 2025 •

edited

Loading

Uh oh!

normanrz left a comment

Uh oh!

Uh oh!

Uh oh!

normanrz left a comment

Uh oh!

d-v-b commented May 13, 2025

Uh oh!

normanrz commented May 16, 2025

Uh oh!

d-v-b commented May 16, 2025

Uh oh!

normanrz commented May 16, 2025

Uh oh!

d-v-b commented May 16, 2025

Uh oh!

normanrz commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

ZEP9: Parse Metadata Objects #2866

Are you sure you want to change the base?

ZEP9: Parse Metadata Objects #2866

Uh oh!

Conversation

brokkoli71 commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

normanrz left a comment

Choose a reason for hiding this comment

Uh oh!

d-v-b commented May 13, 2025

Uh oh!

normanrz commented May 16, 2025

Uh oh!

d-v-b commented May 16, 2025

Uh oh!

normanrz commented May 16, 2025

Uh oh!

d-v-b commented May 16, 2025

Uh oh!

normanrz commented May 16, 2025

Uh oh!

Uh oh!

brokkoli71 commented Feb 26, 2025 •

edited

Loading