ZEP0004 Review - Zarr Conventions #262

MSanKeys963 · 2023-08-17T23:09:01Z

Hi everyone! 👋🏻

I did some preliminary work for ZEP0004 review, as mentioned here.

@rabernat, please have a look and let us know your thoughts. Thanks!

Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>

rabernat · 2023-08-18T00:44:57Z

Thanks so much @MSanKeys963 for getting this started. It's a perfect place to start. Here's what I will try to do over the next few days

Port over some of the text from ZEP4 to the specs website
Create a convention template
Create at least one full example convention

In the meantime, we can use this PR to continue the discussion started in https://github.com/zarr-developers/zeps/pull/28/files where ZEP4 was first proposed.

@ivirshup I know you have lots of ideas here, and you have been very patient as this ZEP has moved forward very slowly. 🙏 I'd love to hear more about your use cases for conventions anndata and the other projects you're involved with.

benbovy · 2023-09-21T07:59:48Z

Would it make sense to suggest a zarr convention providing a json-schema in addition to a document?

From what I read in zarr-developers/zeps#28 the goal of ZEP4 is to have a common place where to find information about how to store some domain-specific metadata in a standard-ish way rather than something that should be strictly enforced. So a convention document is more important than a json-schema and the latter should really be optional.

Having a json-schema may be nice to avoid making mistakes or misinterpretations while implementing a convention in a domain-specific library. Using existing tooling would make the process faster too I guess? I'm not familiar with json-schema, though, so I don't know if it is compatible with the modularity and flexibility of zarr conventions as proposed in ZEP4. Are json schemas easily composable?

tasansal · 2023-11-07T22:31:44Z

Would it make sense to suggest a zarr convention providing a json-schema in addition to a document?

From what I read in zarr-developers/zeps#28 the goal of ZEP4 is to have a common place where to find information about how to store some domain-specific metadata in a standard-ish way rather than something that should be strictly enforced. So a convention document is more important than a json-schema and the latter should really be optional.

Having a json-schema may be nice to avoid making mistakes or misinterpretations while implementing a convention in a domain-specific library. Using existing tooling would make the process faster too I guess? I'm not familiar with json-schema, though, so I don't know if it is compatible with the modularity and flexibility of zarr conventions as proposed in ZEP4. Are json schemas easily composable?

JSON schema makes sense to me, and I have implemented some in Pydantic for a different project. However, it gets a bit ugly when you start using hyphens for key names and symbols for namespaces as proposed in ZEP004.

Stuff like this, i.e. no programming language allows hyphens in variable names and they need aliases. Luckily pydantic has this, but not sure what would happen in other languages. Parsing can be difficult. It also gets very nested and confusing too.

CoordinateUnits is a combination of DistanceUnits + more stuff.

DistanceUnit is a combination of Metric/Imperial length units etc. Below is "allowed" imperial length units in v1 as an enum (Unit is a StrEnum with a few convenience methods).

Any thoughts? The example above allows JSON specification like this:

{"units-v1": {"distance": "ft"}}
// or
{"units-v1": {"angle": "rad"}}

at the end of the day you end up with a schema like this, which is nice, but implementation makes me want to barf :)

clbarnes · 2024-02-03T15:08:17Z

Just dropping in having seen the ZEP page https://zarr.dev/zeps/draft/ZEP0004.html - is there any advantage to the flexibility around keeping a convention's configuration inside or not inside its own object within the attributes? I think we could stand to be more opinionated here and require that the config is kept in its own sub-object: this avoids name collisions and keeps everything together. That would also become the obvious place to keep the convention version, rather than having to encode it in the name. It also makes the jsonschema marginally easier, as you only have to describe the convention config object rather than the whole attributes object containing the convention config.

Also this way, the zarr_conventions array could become an object of {"convention_name": {"version": "2", ...}}, so that it only needs to be defined once. This would also allow it to be promoted out of the attributes entirely, although I am going back and forth on that myself as it adds yet another place to look for metadata and doesn't fit with the adjacently-tagged enum convention in the rest of zarr.

Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object?

d-v-b · 2024-02-04T14:41:18Z

I think we could stand to be more opinionated here and require that the config is kept in its own sub-object: this avoids name collisions and keeps everything together.

I'm 100% on board with this

Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object?

I'm not aware of any, but I am curious if anyone knows differently.

rabernat · 2024-02-04T19:20:06Z

Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object?

I think this sounds fine.

I would welcome explicit suggestions on the PR. I know that I have been very slow to move this forward. The space of possibilities feels vast. Specifically, @clbarnes - would you like to turn your suggestions into text on the ZEP? I would gladly incorporate that.

The same thing goes for folks who favor JSON schema. Please suggest language you would like to see in the ZEP.

docs/conventions/index.rst

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

yarikoptic · 2024-03-26T20:19:14Z

docs/conventions/index.rst

+      "zarr_format": 3,
+      "node_type": "group",
+      "attributes": {
+         "zarr_conventions": ["units-v1", "foo"],


any chance to make it more "specific" but also descriptive to potentially "decentralize" such conventions, while still allowing for a generic validation of zarrs. E.g. it could become here a dict of conversions, with their versions and schema (jsonschema ? or may be linkml?) URLs . e.g.

Suggested change

"zarr_conventions": ["units-v1", "foo"],

"zarr_conventions": {

"units": {

"version": 1,

"homepage": " ... URL which has potential to describe what that is about ...",

"schema_url": "... hosted somewhere ..."

},

"foo": {}

},

where in above units is a well defined convention and foo is not so good (just for an example).

Providing schema to go along would open opportunity for a generic zarr validator to validate embedded in a zarr attributes following the schema. It is reflective of an approach NWB standard took - it stores a copy of the schema for itself of each of the extensions within .nwb (hdf5) file so it becomes feasible to do generic validation and also open it up following those embedded schemas even if extension library is not installed.

Separation of version from the convention name also would make it cleaner and diff upon upgrade from one version to another becoming "to the point" (instead of changing every attribute name) thus making it easier to review etc.

I am not that savvy in zarr and thus acknowledge that development of the schema formalization for conventions might be a larger effort than intended for this ZEP, so might better be postponed. But establishing record of zarr_conventions as a collection of records instead of just a list, would at least open such possibility without in the future requiring breaking type changes. Or may be it is already "easy" to add basic "schema" support here?

I think this is a great suggestion @yarikoptic!

@yarikoptic should we rename schema-url to schemaUrl to adhere to JSON common practices? Hyphens, when parsed in some languages, cause issues / require special handling.

What exactly is the use case for storing a schema (or url to a schema) for the attributes alongside the attributes? I don't see how validation is attractive in this situation, because presumably if the attributes don't pass validation from that schema, you wouldn't write them to disk in the first place. If i'm a client reading a Zarr group that implements some schema that I am aware of, then by definition I already have the schema, so including the schema in the Zarr attributes is useless here; whereas, if it's a schema I'm not aware of, then why should I care if validation of that schema succeeds or fails?

I can see why a data stores that support partial reads would expose schemas, because you don't want clients to read everything just to know what's in it, but Zarr attributes are just JSON documents, so partial reading isn't really part of the picture there.

It's very likely that I don't understand the use case, so a motivating example would really help here.

I don't see how validation is attractive in this situation, because presumably if the attributes don't pass validation from that schema, you wouldn't write them to disk in the first place.

That is quite a big assumption which would be impossible to verify unless schema is stored/pointed to explicitly. "Explicit better than implicit" (Zen of Python #2). There can be a number of buggy client implementations, etc. Absent formalization of schema on Zarr level would facilitate "schema-free" conventions down-stream, thus facilitate breeding unformalized conventions/extensions.

Besides validation, having a schema over the fields might open opportunities for automated metadata-visualization/editing UI constructions (e.g. using smth like https://github.com/koumoul-dev/vuetify-jsonschema-form/ for vue) etc.

FWIW, having machine readable schema is a great feature for a standard to have: e.g. a foundational design principle within https://www.nwb.org/ (https://github.com/NeurodataWithoutBorders/nwb-schema), and recently (well -- years back but still being formalized) established within https://bids-specification.readthedocs.io/ (src/schema), but already acknowledged to be of great importance.

not exactly sure what you propose to reply constructively, but sounds like "dump everything into a dict" which would be counter-effective to the original intention of this ZEP to (citing from https://zarr.dev/zeps/draft/ZEP0004.html; emphasis is mine)

.. standardize conventions around metadata and layout of Zarr data using user-defined attributes ...

It is correct that I don't agree with the proposal of the ZEP, insofar as it proposes to embed schema / type information inside the thing being schematized / typed, but I'm not really advocating for "dumping everything in a dict" either.

What I advocate is very simple: Schemas and associated tooling should be used to generate and validate zarr hierarchies. E.g., defining Zarr hierarchies as typed data, and checking that instances of Zarr hierarchies pass type-checking. See pydantic-zarr for an example of this approach. This is a very simple idea: take some unstructured data, apply a type system to it, get structured data, move on. What's missing from this picture is the need to staple the type information to the data after you have type checked it, but that's essentially what this ZEP proposes, and what I disagree with.

I think there seems to be a tension between making the format more robustly machine verifiable/machine parseable (embed schema URL) and making the format more readily human readable and human writable (use only short identifier).

If attributes have to be (in practice) identified by a URL, then it becomes challenging for humans to write the format except by copy and paste.

I can see the merits of both sides, though personally I am inclined towards human readable/writable, similar to HTML and CSS, where short identifiers are used.

I would be inclined to just say that the convention is identified by the name of the attribute itself, and there is no separate "zarr_connections" attribute, and it generally becomes easier to work with and edit, compared to having to keep both a dictionary of attributes and a list of conventions in sync. It also avoids the possibility that two conventions would assign different meanings to the same attribute, which would prevent using both conventions at the same time. If URLs are used to identify conventions, then the attribute name would itself be the URL, which would be awkward, though.

@jbms, I like the idea of not having a separate zarr_conventions. Can you elaborate on your short identifier idea? Do you mean like the original idea in the ZEP; i.e. units_v1 or something different? In this scenario, how would we avoid conflicts? If we make the key more elaborate, then the identifier is going to be not short.

Or are you thinking more of a hierarchical way to define the attributes?

i.e.

{ "units": { "length": "meter", "_convention": "< some-identifier >" } }

{ "stats": { "std": 42, "mean": 100, "_convention": "< some-identifier >" } }

Yes, I mean something like units_v1 --- a nested _contention seems worse than just using the convention as the property identifier itself.

However, I think we certainly do need a way to distinguish arbitrary non-standard metadata (which will presumably be used very widely) from standardized metadata properties that should be listed in some registry to ensure the identifier is unique.

I'm not sure exactly what sort of syntax makes sense --- possibly something like "std:units" or "zarr:units" or "$unit" or "@Units".

However, I think the idea behind this zarr_conventions proposal is that you may already have a collection of datasets with various metadata, and software that consumes that metadata, and therefore do not want to change the representation of the metadata at all. Instead you just want to tack on an additional property to indicate what metadata conventions are in use without having to modify the existing data or software. Potentially these "legacy" conventions could still be handled as a separate property per-convention though --- e.g. you could set "std:cf-conventions": true but then there would be additional non-prefixed properties as defined by the convention.

tasansal · 2024-04-15T13:46:39Z

I have a units convention defined in another open source project. With the current state of things, what's the best way to share this? It has a json schema with namespaces for different unit types:

(edited to be similar to the explicit convention suggestion by @yarikoptic). I really like the JSON schema idea because we can run validation against it.
https://mdio-python.readthedocs.io/en/v1/data_models/version_1.html#mdio.schemas.v1.units.MDIOUnitsV1

(Expand units dropdown if it doesn't show up via hyperlink). If you press show json schema it'll show there too. It's all pydantic and pint based.

The way we can currently specify it is like this in the variable attributes.

Within array .zattrs

"units": {"density": "g/cm**3"},

Within group (?) .zattrs

"zarr_conventions": {
 "units": {
   "version": 1, 
   "homepage": "< reorged new link to rtd for convention >",
   "schema-url": "< maybe new repo with metadata conventions in json >"
   }
}

The ZEP is unclear on some aspects. Can we meet sometime to formalize the ZEP, freeze it, and start a concrete implementation? I have many use cases for this :)

Some Qs;

Who maintains the convention? Zarr or individual domain projects, or both? ZEP0004 says it should be hosted on Zarr specs? What if we have a super domain-specific thing the overall Zarr community wouldn't care about?
Where in zarr-specs would the above go? I don't see any current placeholders.
Can be set per array and group. What is the expected behavior? which one overrides which etc? Maybe we should have the conventions allowed only at root group level that applies to whole file?

... and more

d-v-b · 2024-04-15T14:50:00Z

Why should conventional zarr hierarchies be responsible for expressing which conventions they adhere to? (This amounts to the question of why nominal, rather than structural, typing is the right solution here).

Also, how can this effort express conventions w.r.t the layout of arrays and groups in a hierarchy?

An alternative strategy is for Zarr hierarchy consumers to define the conventions they support, and they use the structure of Zarr hierarchies as the "signature" of those conventions. In this scenario, we would benefit from a common language for expressing a Zarr convention as a piece of data. Because the layout of a Zarr hierarchy is invariably part of the structure in-scope for a convention, we need a piece of data that can express the structure + attributes of a Zarr hierarchy. This is addressed by the zarr object models ZEP #zarr-developers/zeps#46.

So, tl;dr, I don't see why we need to define a nominal type system for Zarr attributes, when we can do structural typing on the entire hierarchy (or parts of a hierarchy).

yarikoptic · 2024-04-15T19:11:16Z

docs/conventions/index.rst

+      "node_type": "group",
+      "attributes": {
+         "zarr_conventions": ["units-v1", "foo"],
+      }


Choice of schema-language

wanted to create a separate thread since prior one is overloaded. If the idea of "reference/contain a schema for a convention" would generally be accepted, might be worth to look into defining it in https://linkml.io/ instead of jsonschema since 1. more human readable/friendly ; 2. can be converted to jsonschema (or pydantic or ... see https://linkml.io/linkml/intro/overview.html#feature-rich-modeling-language )

Might be easier to establish such schemas. Not yet sure if would be easier to use in some cases, so might be worthwhile accompanying with both linkml and jsonschema urls... sorry if I am adding another level of complexity right away - but wanted to establish the "target horizon" right away ;-)

Add conventions for ZEP0004 review

ab8d4a1

Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>

rabernat mentioned this pull request Aug 18, 2023

ZEP 4: Metadata Conventions zarr-developers/zeps#28

Merged

rabernat added 2 commits February 4, 2024 15:04

added some text to the main conventions page

86acb1b

add xarray convention and reorganize

e7738bc

joshmoore mentioned this pull request Feb 23, 2024

Question: given a .zarr what is the best way to say that it is .ome.zarr / ngff? ome/ngff#228

Open

yarikoptic reviewed Mar 26, 2024

View reviewed changes

docs/conventions/index.rst Outdated Show resolved Hide resolved

Update docs/conventions/index.rst

464ce14

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

yarikoptic reviewed Mar 26, 2024

View reviewed changes

yarikoptic reviewed Apr 15, 2024

View reviewed changes

This was referenced Sep 11, 2024

Spec version vs zarr_format #299

Open

Related Standards bids-standard/bids-specification#401

Open

rabernat mentioned this pull request Oct 11, 2024

Lessons to learn from STAC's extensibility #316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZEP0004 Review - Zarr Conventions #262

ZEP0004 Review - Zarr Conventions #262

MSanKeys963 commented Aug 17, 2023

rabernat commented Aug 18, 2023 •

edited

Loading

benbovy commented Sep 21, 2023

tasansal commented Nov 7, 2023 •

edited

Loading

clbarnes commented Feb 3, 2024

d-v-b commented Feb 4, 2024

rabernat commented Feb 4, 2024 •

edited

Loading

yarikoptic Mar 26, 2024 •

edited

Loading

rabernat Apr 11, 2024

tasansal Apr 15, 2024 •

edited

Loading

d-v-b Apr 15, 2024 •

edited

Loading

yarikoptic Apr 15, 2024

yarikoptic Apr 15, 2024

d-v-b Apr 15, 2024

jbms Apr 15, 2024 •

edited

Loading

tasansal Apr 16, 2024

jbms Apr 16, 2024

tasansal commented Apr 15, 2024 •

edited

Loading

d-v-b commented Apr 15, 2024

yarikoptic Apr 15, 2024

ZEP0004 Review - Zarr Conventions #262

Are you sure you want to change the base?

ZEP0004 Review - Zarr Conventions #262

Conversation

MSanKeys963 commented Aug 17, 2023

rabernat commented Aug 18, 2023 • edited Loading

benbovy commented Sep 21, 2023

tasansal commented Nov 7, 2023 • edited Loading

clbarnes commented Feb 3, 2024

d-v-b commented Feb 4, 2024

rabernat commented Feb 4, 2024 • edited Loading

yarikoptic Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

rabernat Apr 11, 2024

Choose a reason for hiding this comment

tasansal Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

d-v-b Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

yarikoptic Apr 15, 2024

Choose a reason for hiding this comment

yarikoptic Apr 15, 2024

Choose a reason for hiding this comment

d-v-b Apr 15, 2024

Choose a reason for hiding this comment

jbms Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

tasansal Apr 16, 2024

Choose a reason for hiding this comment

jbms Apr 16, 2024

Choose a reason for hiding this comment

tasansal commented Apr 15, 2024 • edited Loading

d-v-b commented Apr 15, 2024

yarikoptic Apr 15, 2024

Choose a reason for hiding this comment

Choice of schema-language

rabernat commented Aug 18, 2023 •

edited

Loading

tasansal commented Nov 7, 2023 •

edited

Loading

rabernat commented Feb 4, 2024 •

edited

Loading

yarikoptic Mar 26, 2024 •

edited

Loading

tasansal Apr 15, 2024 •

edited

Loading

d-v-b Apr 15, 2024 •

edited

Loading

jbms Apr 15, 2024 •

edited

Loading

tasansal commented Apr 15, 2024 •

edited

Loading