Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP0004 Review - Zarr Conventions #262

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ docs/_build

# pycharm
.idea

.DS_Store
141 changes: 141 additions & 0 deletions docs/conventions/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
===========
Conventions
===========

Why Conventions?
~~~~~~~~~~~~~~~~

Zarr Conventions provide a mechanism to standardize metadata and layout of Zarr data
in order to meet domain-specific application needs without changes to the
core Zarr data model and specification, and without specification extensions.

Conventions must fit completely within the Zarr data / metadata model of groups, arrays, and attributes thereof, requiring
no changes or extension to the specification.
A Zarr implementation itself should not even be aware of the existence of the convention.
The line between a convention and an extension may be blurry in some cases.
The key distinction lies in the implementation: the responsibility for interpreting a *convention* relies completely with downstream,
domain-specific software, while an *extension* must be handled by the Zarr implementation itself.
A good rule of thumb is that a user should be able to safely ignore the convention and still be able to interact with the data via the core Zarr library,
even if some domain-specific context or functionality is missing.
If the data are completely meaningless or unintelligible without the convention, then it should be an extension instead.

Conventions can also help users switch between different storage libraries more flexibly.
Since Zarr and HDF5 implement nearly identical data models, a single convention can be applied to both formats.
This allows downstream software to maintain better separation of concerns between storage and domain-specific logic.

Conventions are modular and composable. A single group or array can conform to multiple conventions.


Describing Conventions
~~~~~~~~~~~~~~~~~~~~~~

Conventions Document
--------------------

Conventions are described by a *convention document*.
TODO: say more about the structure and format of this document

Explicit Conventions
--------------------

The preferred way of identifying the presence of a convention in a Zarr group or array is via the attribute `zarr_conventions`.
This attribute must be an array of strings; each string is an identifier for the convention.
Multiple conventions may be present.

For example, a group metadata JSON document with conventions present might look like this

.. code-block:: json

{
"zarr_format": 3,
"node_type": "group",
"attributes": {
"zarr_conventions": ["units-v1", "foo"],
Copy link

@yarikoptic yarikoptic Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance to make it more "specific" but also descriptive to potentially "decentralize" such conventions, while still allowing for a generic validation of zarrs. E.g. it could become here a dict of conversions, with their versions and schema (jsonschema ? or may be linkml?) URLs . e.g.

Suggested change
"zarr_conventions": ["units-v1", "foo"],
"zarr_conventions": {
"units": {
"version": 1,
"homepage": " ... URL which has potential to describe what that is about ...",
"schema_url": "... hosted somewhere ..."
},
"foo": {}
},

where in above units is a well defined convention and foo is not so good (just for an example).

Providing schema to go along would open opportunity for a generic zarr validator to validate embedded in a zarr attributes following the schema. It is reflective of an approach NWB standard took - it stores a copy of the schema for itself of each of the extensions within .nwb (hdf5) file so it becomes feasible to do generic validation and also open it up following those embedded schemas even if extension library is not installed.

Separation of version from the convention name also would make it cleaner and diff upon upgrade from one version to another becoming "to the point" (instead of changing every attribute name) thus making it easier to review etc.

I am not that savvy in zarr and thus acknowledge that development of the schema formalization for conventions might be a larger effort than intended for this ZEP, so might better be postponed. But establishing record of zarr_conventions as a collection of records instead of just a list, would at least open such possibility without in the future requiring breaking type changes. Or may be it is already "easy" to add basic "schema" support here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great suggestion @yarikoptic!

Copy link

@tasansal tasansal Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic should we rename schema-url to schemaUrl to adhere to JSON common practices? Hyphens, when parsed in some languages, cause issues / require special handling.

Copy link
Contributor

@d-v-b d-v-b Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the use case for storing a schema (or url to a schema) for the attributes alongside the attributes? I don't see how validation is attractive in this situation, because presumably if the attributes don't pass validation from that schema, you wouldn't write them to disk in the first place. If i'm a client reading a Zarr group that implements some schema that I am aware of, then by definition I already have the schema, so including the schema in the Zarr attributes is useless here; whereas, if it's a schema I'm not aware of, then why should I care if validation of that schema succeeds or fails?

I can see why a data stores that support partial reads would expose schemas, because you don't want clients to read everything just to know what's in it, but Zarr attributes are just JSON documents, so partial reading isn't really part of the picture there.

It's very likely that I don't understand the use case, so a motivating example would really help here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how validation is attractive in this situation, because presumably if the attributes don't pass validation from that schema, you wouldn't write them to disk in the first place.

That is quite a big assumption which would be impossible to verify unless schema is stored/pointed to explicitly. "Explicit better than implicit" (Zen of Python #2). There can be a number of buggy client implementations, etc. Absent formalization of schema on Zarr level would facilitate "schema-free" conventions down-stream, thus facilitate breeding unformalized conventions/extensions.

Besides validation, having a schema over the fields might open opportunities for automated metadata-visualization/editing UI constructions (e.g. using smth like https://github.com/koumoul-dev/vuetify-jsonschema-form/ for vue) etc.

FWIW, having machine readable schema is a great feature for a standard to have: e.g. a foundational design principle within https://www.nwb.org/ (https://github.com/NeurodataWithoutBorders/nwb-schema), and recently (well -- years back but still being formalized) established within https://bids-specification.readthedocs.io/ (src/schema), but already acknowledged to be of great importance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly sure what you propose to reply constructively, but sounds like "dump everything into a dict" which would be counter-effective to the original intention of this ZEP to (citing from https://zarr.dev/zeps/draft/ZEP0004.html; emphasis is mine)

.. standardize conventions around metadata and layout of Zarr data using user-defined attributes ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct that I don't agree with the proposal of the ZEP, insofar as it proposes to embed schema / type information inside the thing being schematized / typed, but I'm not really advocating for "dumping everything in a dict" either.

What I advocate is very simple: Schemas and associated tooling should be used to generate and validate zarr hierarchies. E.g., defining Zarr hierarchies as typed data, and checking that instances of Zarr hierarchies pass type-checking. See pydantic-zarr for an example of this approach. This is a very simple idea: take some unstructured data, apply a type system to it, get structured data, move on. What's missing from this picture is the need to staple the type information to the data after you have type checked it, but that's essentially what this ZEP proposes, and what I disagree with.

Copy link
Contributor

@jbms jbms Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there seems to be a tension between making the format more robustly machine verifiable/machine parseable (embed schema URL) and making the format more readily human readable and human writable (use only short identifier).

If attributes have to be (in practice) identified by a URL, then it becomes challenging for humans to write the format except by copy and paste.

I can see the merits of both sides, though personally I am inclined towards human readable/writable, similar to HTML and CSS, where short identifiers are used.

I would be inclined to just say that the convention is identified by the name of the attribute itself, and there is no separate "zarr_connections" attribute, and it generally becomes easier to work with and edit, compared to having to keep both a dictionary of attributes and a list of conventions in sync. It also avoids the possibility that two conventions would assign different meanings to the same attribute, which would prevent using both conventions at the same time. If URLs are used to identify conventions, then the attribute name would itself be the URL, which would be awkward, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbms, I like the idea of not having a separate zarr_conventions. Can you elaborate on your short identifier idea? Do you mean like the original idea in the ZEP; i.e. units_v1 or something different? In this scenario, how would we avoid conflicts? If we make the key more elaborate, then the identifier is going to be not short.

Or are you thinking more of a hierarchical way to define the attributes?

i.e.

{
  "units":
  {
    "length": "meter",
    "_convention": "< some-identifier >"
  }
}
{
  "stats":
  {
    "std": 42,
    "mean": 100,
    "_convention": "< some-identifier >"
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean something like units_v1 --- a nested _contention seems worse than just using the convention as the property identifier itself.

However, I think we certainly do need a way to distinguish arbitrary non-standard metadata (which will presumably be used very widely) from standardized metadata properties that should be listed in some registry to ensure the identifier is unique.

I'm not sure exactly what sort of syntax makes sense --- possibly something like "std:units" or "zarr:units" or "$unit" or "@Units".

However, I think the idea behind this zarr_conventions proposal is that you may already have a collection of datasets with various metadata, and software that consumes that metadata, and therefore do not want to change the representation of the metadata at all. Instead you just want to tack on an additional property to indicate what metadata conventions are in use without having to modify the existing data or software. Potentially these "legacy" conventions could still be handled as a separate property per-convention though --- e.g. you could set "std:cf-conventions": true but then there would be additional non-prefixed properties as defined by the convention.

}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice of schema-language

wanted to create a separate thread since prior one is overloaded. If the idea of "reference/contain a schema for a convention" would generally be accepted, might be worth to look into defining it in https://linkml.io/ instead of jsonschema since 1. more human readable/friendly ; 2. can be converted to jsonschema (or pydantic or ... see https://linkml.io/linkml/intro/overview.html#feature-rich-modeling-language )

Might be easier to establish such schemas. Not yet sure if would be easier to use in some cases, so might be worthwhile accompanying with both linkml and jsonschema urls... sorry if I am adding another level of complexity right away - but wanted to establish the "target horizon" right away ;-)

}

where `units-v1` and `bar` are the convention identifiers.


Legacy Conventions
------------------

A legacy convention is a convention already in use that predates this ZEP.
Data conforming to legacy conventions will not have the `zarr_conventions` attribute.
The conventions document must therefore specify how software can identify the presence of the convention through a series of rules or tests.

For those comfortable with the terminology, legacy conventions can be thought of as a "conformance class" and a corresponding "conformance test".

Namespacing
-----------

Conventions may choose to store their attributes on a specific namespace.
This ZEP does not specify how namespacing works; that is up to the convention.
For example, the namespace may be specified as a prefix on attributes, e.g.

.. code-block:: json

{
"attributes": {"units-v1:units": "m^2"}
}


or via a nested JSON object, e.g.

.. code-block:: json

{
"attributes": {"units-v1": {"units: "m^2"}}
}

The use of namespacing is optional and is up to the convention to decide.


Proposing Conventions
~~~~~~~~~~~~~~~~~~~~~

New conventions are proposed via a pull-request to the `zarr-specs` repo which adds a new conventions document.
If the convention is already documented elsewhere, the convention document can just contain a reference to the external documentation.
The author of the PR is expected to convene the relevant domain community to review and discuss the ZEP.
This includes posting a link to the PR on relevant forums, mailing lists, and social-media platforms.

The goal of the discussion is to reach a _consensus_ among the domain community regarding the convention.
The Zarr steering council, together with the PR author, will determine if a consensus has been reached, at which point the PR
can be merged and the convention published on the website.
If a consensus cannot be reached, the steering council may still decide to publish the convention, accompanied by a
disclaimer that it is not a consensus, and noting any objections that were raised during the discussion.

It is also possible that multiple, competing conventions exist in the same domain. While not ideal, it's not up to
the Zarr community to resolve such domain-specific debates.
These conventions should still be documented in a central location, which hopefully helps move towards alignment.

Conventions should be versioned using incremental integers, starting from 1.
Or, if the community already has an existing versioning system for their convention, that can be used instead (e.g. CF conventions).
The community is free to update their convention via a pull request using the same consensus process described above.
The conventions document should include a changelog.
Details of how to manage changes and backwards compatibility are left to the domain community.


Existing Conventions
~~~~~~~~~~~~~~~~~~~~


This page lists the Zarr conventions. The proposal to formalize the conventions is introduced in `ZEP0004 <https://zarr.dev/zeps/draft/ZEP0004.html>`_.

Some of the widely used conventions are:

- `GDAL <https://gdal.org/drivers/raster/zarr.html>`_
- `OME-NGFF <https://ngff.openmicroscopy.org/>`_
- `NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_
- `Xarray <https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html>`_

Any new conventions accepted by the `ZEP <https://zarr.dev/zeps/active/ZEP0000.html>`_ process will be listed here.

.. toctree::
:glob:
:maxdepth: 1
:titlesonly:
:caption: Contents:

xarray

99 changes: 99 additions & 0 deletions docs/conventions/xarray.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
======================
Xarray Zarr Convention
======================

+---------------------+----------------------+
| Convention Type | Legacy |
+---------------------+----------------------+
| Zarr Spec Versions | V2 |
+---------------------+----------------------+
| Status | Active |
+---------------------+----------------------+
| Active Dates | 2018 - present |
+---------------------+----------------------+
| Version | 1 |
+---------------------+----------------------+

See also `Zarr Encoding Specification <https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html>`_
in the Xarray docs.


Description
-----------

`Xarray`_ is a Python library for working with labeled multi-dimensional arrays.
Xarray was originally designed to read only `NetCDF`_ files, but has since added support for
other formats.
In implementing support for the `Zarr <https://zarr.dev>`_ storage format, Xarray developers
made some *ad hoc* choices about how to store NetCDF-style data in Zarr.
These choices have become a de facto convention for mapping the Zarr data model to the
`NetCDF data model <https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html>`_

First, Xarray can only read and write Zarr groups. There is currently no support
for reading / writing individual Zarr arrays. Zarr groups are mapped to
Xarray ``Dataset`` objects, which correspond to NetCDF-4 / HDF5 groups.

Second, from Xarray's point of view, the key difference between
NetCDF and Zarr is that all NetCDF arrays have *dimension names* while Zarr
arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must
somehow encode and decode the name of each array's dimensions.

To accomplish this, Xarray developers decided to define a special Zarr array
attribute: ``_ARRAY_DIMENSIONS``. The value of this attribute is a list of
dimension names (strings), for example ``["time", "lon", "lat"]``. When writing
data to Zarr, Xarray sets this attribute on all variables based on the variable
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
arrays, raising an error if it can't be found. The attribute is used to define
the variable dimension names and then removed from the attributes dictionary
returned to the user.

Because of these choices, Xarray cannot read arbitrary array data, but only
Zarr data with valid ``_ARRAY_DIMENSIONS`` attributes on each array.

After decoding the ``_ARRAY_DIMENSIONS`` attribute and assigning the variable
dimensions, Xarray proceeds to [optionally] decode each variable using its
standard `CF Conventions`_ decoding machinery used for NetCDF data.

Finally, it's worth noting that Xarray writes (and attempts to read)
"consolidated metadata" by default (the ``.zmetadata`` file), which is another
non-standard Zarr extension, albeit one implemented upstream in Zarr-Python.

.. _Xarray: http://xarray.dev
.. _NetCDF: https://www.unidata.ucar.edu/software/netcdf
.. _CF Conventions: http://cfconventions.org


Identifying the Presence of this Convention
-------------------------------------------

In implementing this convention, Xarray developers made the unfortunate choice of not
including any explicit identifier in the Zarr metadata. Therefore, the only way to
determine whether the convention is being used is to attempt to examine contents of the
Zarr dataset and look for the following properties:

* A single flat group containing one or more arrays
* The presence of the ``_ARRAY_DIMENSIONS`` attribute on each array, whose contents are
a list of dimension names (strings)
* If the dimension name corresponds to another array name within the group, that array is
assumed to be a dimension coordinate. Dimension coordinates arrays must be 1D
and have the same length as the corresponding dimension.


CF Conventions
--------------

It is common for data stored in Zarr using the Xarray convention to also follow
the `Climate and Forecast (CF) Metadata Conventions <CF Conventions>`_.

A high-level description of these conventions, quoted from the CF Documentation is as follows:

The NetCDF library [NetCDF] is designed to read and write data that has been structured
according to well-defined rules and is easily ported across various computer platforms.
The netCDF interface enables but does not require the creation of self-describing datasets.
The purpose of the CF conventions is to require conforming datasets to contain sufficient
metadata that they are self-describing in the sense that each variable in the file has an
associated description of what it represents, including physical units if appropriate,
and that each value can be located in space (relative to earth-based coordinates) and time.

The CF Conventions are massive and cover a wide range of topics. Readers should consult the
`CF Conventions`_ documentation for more information.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ A good starting point is the :ref:`zarr-core-specification-v3.0`.
.. toctree::

Home <https://zarr.dev>
specs
conventions
ZEPs <https://zarr.dev/zeps>
Implementations <https://github.com/zarr-developers/zarr_implementations>

Expand Down
6 changes: 6 additions & 0 deletions docs/specs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ Specifications
v3/stores
v3/array-storage-transformers

.. toctree::
:maxdepth: 1
:caption: Conventions

Conventions <conventions/index.rst>

.. toctree::
:maxdepth: 1
:caption: v2
Expand Down