-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft] zarr object models #46
base: main
Are you sure you want to change the base?
Conversation
For zarr v3, the user-defined attributes are stored within the main metadata file under the |
@jbms That's a great point; I'm happy to expand |
I am not sure I got the motivation for this. I understand the motivation of consolidated metadata. Maybe this ZEP should morph into that? Then, it should be an extension to the zarr.json for persistence imo. Also, I would focus the ZEP on v3. |
Agree with @normanrz, a section for motivation and usage would be helpful. |
I totally agree that the motivation needs to be expanded. At the moment, all we say is
@normanrz do you agree that these are valid motivations, and worthy of expanding on, or should we provide additional motivations?
The basic ZOM applies to v3 and v2 equally, and I think it's important to emphasize this, because the ZOM representation would be useful for converting from v2 hierarchies to v3 (and from v3 to v4, if v4 ever exists). Would it help if I made this point clearly in the ZEP? |
To help with the motivation, I think this point
could be emphasized further, perhaps with an example of how it could eventually work. It seems like this is a kind of "extended consolidated metadata" and could be framed more in that way. Beyond the "base consolidated metadata", my understanding is that this would also include the contents of the .zattrs/.zarray/.zgroup files (and the v3 equivalent) which would be used to implement the support for typing / validation / comparison.
Perhaps it would be simpler to use the metadata file names directly in the flattened representation rather than abstracting / trying to unify across versions. Using It could be useful to define the name of an optional property that could be used to specify the URL of a JSON schema to use for validation of ZOM-structured stores such as OME-NGFF or AnnData. For example, Vega-Lite uses a Perhaps outside the scope of this proposal, but related, it might be useful for Zarr to make a distinction between ZOM-structured stores vs. non-ZOM-structured stores in the name of the root file/directory (as a convention, not a requirement). For example, as a human, if i look in my file explorer and see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this started @d-v-b! I think this will really aid in the maintenance and development of Zarr implementations, old and new.
## Implementation | ||
|
||
- pydantic zarr | ||
- ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- dataclass zarr
(I have an unpublished version that I can share soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarrita also has attrs classes that define the metadata (minus the new members
properties) https://github.com/scalableminds/zarrita/blob/async/zarrita/metadata.py#L259
- - The origins of consolidated metadata: | ||
* <https://github.com/pangeo-data/pangeo/issues/309> | ||
* <https://github.com/zarr-developers/zarr-python/pull/268> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may also be worth summarizing some of the intended benefits to existing/internal applications. For example, the utilization of a standard data object internally within zarr-python may help improve workflow for creating large hierarchies by allowing users to create the ZOM metadata before passing it to a zarr.creation method.
draft/ZEP0006.md
Outdated
And Zarr V3: | ||
|
||
```json | ||
# insert schema for v3 here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if you could use some help generating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some help here would be great, thank you!
just a clarification about the timeline: I'm working on a transatlantic move, so I can't promise a huge investment in this until mid-october. Thanks for your patience! |
…xamples, change attrs to attributes
as per feedback, the field used for user metadata has been renamed from |
## Implementation | ||
|
||
- pydantic zarr | ||
- ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarrita also has attrs classes that define the metadata (minus the new members
properties) https://github.com/scalableminds/zarrita/blob/async/zarrita/metadata.py#L259
draft/ZEP0006.md
Outdated
- `members`: a key-value data structure where the keys are strings and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them. | ||
|
||
If future versions of Zarr use a property called `members` for some element of Zarr group metadata, then there would be a naming collision between the `members` property of a Zarr group and the `members` property of a ZOM group. In this case, the ZOM group would rename the Zarr group's `members` property to `_members`, and any additional name collisions would be resolved by prepending additional underscore ("_") characters. E.g., in the unlikely case that `members` and `_members` are *both* listed in Zarr group metadata, then the schema group representation would map the `members` property of the Zarr group to a property called `__members`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to reiterate the idea of broadening this ZEP to include persisted consolidated metadata. Basically, why not allow to store the members
property in a zarr.json?
We would need to define the semantics of consolidated metadata (e.g. do member nodes still needs json files, does the members hierarchy need to be exhaustive). I would be happy to contribute that if there is interest to move this ZEP in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's definitely interest in that, my apologies for not making this more clear earlier. I'm not a user of consolidated metadata so I don't have a lot of experience with it, but I think for this ZEP to encompass consolidated metadata functionality as it exists today (i.e, a flat list of string keys pointing to JSON objects) we would need to define a tree flattening operation, and possible make members
nullable (because in a flattened representation a ZOM group shouldn't hold a reference to its contents). Alternatively, if the flattened representation of the hierarchy used in consolidated metadata isn't essential to its function, we could simply just put a ZOM in JSON and leave it to clients to do the flattening. I don't have strong feelings either way! You should absolutely feel free contribute something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep consolidated metadata separate from this ZOM concept. Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.
We intend to propose a ZEP which uses a STAC style link-relation element to allow a client to traverse an entire hierarchy without being able to list a store. This is similar to consolidated metadata but more scalable because it does not require all the metadata to be in a single json file. For context, we have hierarchies with 100_000 nodes. Would be happy to collaborate and iterate with you on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.
In this case, maybe it would be good to get a statement like this in this ZEP to clarify the relationship between the abstract ZOM and consolidated metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now wondering if this members property needs to be tweaked to accommodate the case of very large hierarchies. As is, it seems like the entire hierarchy may have to be explicitly populated all at once.
In python terms, I'd like to allow members
to be either a set of child objects or a generator that yields such objects lazily. Is this making it too complicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e, a flat list of string keys pointing to JSON objects)
It wouldn't have to have this structure. It could be a nested json structure like this:
{
"zarr_format": 3,
"node_type": "group",
"attributes": {},
"members": {
"some_group": {
"zarr_format": 3,
"node_type": "group",
"attributes": {},
"members": {
"some_array": {
"zarr_format": 3,
"node_type": "array",
...
}
}
}
}
}
I think we should keep consolidated metadata separate from this ZOM concept. Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.
I think there is strong overlap between the ZOM and consolidated metadata. This ZEP introduces a JSON schema that describes the existing metadata of groups and arrays with a new addition of the members
property. I think it would be very confusing, if consolidated metadata would end up with different terminology than the ZOM.
We intend to propose a ZEP which uses a STAC style link-relation element to allow a client to traverse an entire hierarchy without being able to list a store. This is similar to consolidated metadata but more scalable because it does not require all the metadata to be in a single json file. For context, we have hierarchies with 100_000 nodes. Would be happy to collaborate and iterate with you on that.
That sounds great. As I said, we can discuss the semantics and features of the consolidated metadata. That could include linking. I don't think we should limit ourselves by what the implementation in zarr-python currently has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now wondering if this members property needs to be tweaked to accommodate the case of very large hierarchies. As is, it seems like the entire hierarchy may have to be explicitly populated all at once.
This concern is real! See also janelia-cellmap/pydantic-zarr#2 . The proposal there was to make members
nullable, where None
would encode "The members have not been parsed", and to give a tree parser the option to limit the depth of traversal, which would result in "truncated" GroupSpec
instances being valid. But maybe the python generator approach obviates the need to express this with nullability? I'm open to suggestions here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to distinguish between "there are definitely no members" vs. "there may be members, but they have to be discovered explicitly"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in pydantic-zarr
members
is now nullable, and it's been extremely useful. That being said, this can be viewed as a well-defined transformation of the base type, so it's not clear if the ZEP actually needs to address it.
Hi @d-v-b. I've fixed the RTD build issue in #51. The current PR can be viewed here: https://zeps--46.org.readthedocs.build/en/46/draft/ZEP0006.html |
I threw in a JSON schema for ZOM[v3], a ZOM[v3] example hierarchy serialized to JSON, and some python / typescript static typing examples. I am wondering if it would make more sense to push these hierarchy definitions into the v2 and v3 specs as addenda? This ZEP could exist for posterity, but this would be an easier way to formally associate a particular ZOM with a specific Zarr version. Since the change would be purely additive, it seems safe to do retrospectively (in the case of Zarr v2). Thoughts? |
What should the value of Since the If we're willing not to require the I havn't formed a preference. |
@bogovicj so far i've been thinking about exclusively using By contrast, I think there's an argument for making the That being said, if the |
this ZEP defines a representation of a zarr hierarchy, called a Zarr Object Model (ZOM). The purpose of this ZEP is to standardize an abstract representation of zarr hierarchies to support declarative Zarr APIs, and to give type systems access to the structure of zarr hierarchies. A side effect of this ZEP is a standardization of consolidated metadata, which can be defined as a flattening transformation applied to a ZOM representation of a zarr hierarchy.
I didn't use the template structure for this ZEP because it felt limiting, but if that's a big problem I can bring more of that structure back in.
In terms of what needs to be done:
cc @jhamman