Skip to content

ZEP9: Parse Metadata Objects #2866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

brokkoli71
Copy link
Member

@brokkoli71 brokkoli71 commented Feb 26, 2025

This PR implements the following aspects of ZEP9 (Phase 1)

  • If metadata JSON contains invalid keys, or if a value object contains invalid keys, the zarr array should be rejected unless it contains `"must_understand": false``.
  • It should be possible to specify a value in metadata in the following ways
    • as a JSON object
    • As a string value (only if the object would have no attributes other than its name).

this will help to enable zarr extensions

TODOs in code:

  • read metadata fails accordingly (+test)
  • specifying metadata value works as string (+test)
    • in particular bytes codec should not be possible to specify as string value if it requires the "endian" argument for multi-byte data types

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 26, 2025
@brokkoli71 brokkoli71 changed the title prepare ZEP9: Parse Metadata Objects Feb 26, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 9, 2025
@brokkoli71 brokkoli71 marked this pull request as ready for review April 9, 2025 14:23
Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@d-v-b
Copy link
Contributor

d-v-b commented May 13, 2025

what are the current use cases for this?

@normanrz
Copy link
Member

what are the current use cases for this?

There are a number of extensions appearing now in zarr-extensions. Potentially, there will also be modifications to existing extensions such as added attributes in the metadata. zarr-python should fail if it encounters unknown extensions or attributes, unless marked with must_understand=false. This PR helps with that.

@d-v-b
Copy link
Contributor

d-v-b commented May 16, 2025

as I understand it, all of the extensions proposed in zarr-extensions have static JSON representations. Is this not true?

And this PR seems to allow the possibility that any one of those JSON representations might gain new fields, which should be ignored iff those fields are JSON objects containing the {"must_understand" : False}. That seems kind weird to me -- why would we allow an entirely optional subset of, e.g., the gzip codec JSON? I was under the impression that for codecs and data types, the JSON representation contained everything you need to know to make sense of the data type. I don't see how optional, ignorable fields fit into that model.

@normanrz
Copy link
Member

There may be fields that are not strictly necessary for reading data that could be marked as optional. An example might be a "chunk_layout" in the sharding codec to denote how the chunks are ordered in the shard, e.g. morton, c, random etc.. While useful when writing, it is not necessary for reading because all chunk offsets are stored in the index.

Additionally, there may be new optional fields that are added to the root of the array or group metadata through a ZEP.

@d-v-b
Copy link
Contributor

d-v-b commented May 16, 2025

There may be fields that are not strictly necessary for reading data that could be marked as optional. An example might be a "chunk_layout" in the sharding codec to denote how the chunks are ordered in the shard, e.g. morton, c, random etc.. While useful when writing, it is not necessary for reading because all chunk offsets are stored in the index.

Additionally, there may be new optional fields that are added to the root of the array or group metadata through a ZEP.

these examples are not yet in use, which is why I asked what the current use cases are. This PR makes changes to how metadata is parsed (e.g., checking the contents of gzip json metadata) that, as far as I can tell, have no use.

Currently, I think people can expect that zarr-python can round-trip zarr data. To me, that means that if zarr-python can read zarr data from one place, it should be able to create a structurally identical copy of that zarr data somewhere else. The concept in this PR -- that we would support extra metadata fields in any metadata object which can be ignored when reading -- violates this expectation. So I think we need to have a larger conversation about what these hypothetical optional metadata fields mean for zarr-python before we add support for them. Until there are real examples out there of metadata with these must_understand=False fields, I don't feel like I can properly evaluate this PR.

@normanrz
Copy link
Member

Fair enough. Then we should scope this PR to

  • enforcing that unknown fields lead to an error
  • accepting both name-only and object notations of extensions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants