-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr validation and consistency checking #912
Comments
I've found the answer to the first question already: an invalid zarr will not always raise an error immediately upon opening. For example, if a chunk file is malformed, this won't be detected until you actually try to use the containing array. |
Welcome and thanks for the questions.
This is definitely deliberate behavior. Zarr arrays can be petabytes in size with millions of chunks! Individually checking each chunk on opening would not be the right default behavior. Missing chunks are valid in Zarr as well--they represent missing data.
Depends on what you mean by "valid"? If you are asking if the store can be opened by Zarr, then yes, this is sufficient. If you are asking whether your data have been corrupted, then no. You may consider using array.hexdigest to verify data integrity.
Zarr will just ignore those files. I don't think they'll break anything.
See comments above about hexdigest. Also ongoing discussions in #877. |
@rabernat - thank you. ah hexdigest would apply as an overall checksum. we can compute it, but hexdigest could potentially be a very expensive operation. good to know it exists. we are (at least for our backend on s3) working on a tree-hash scheme to store checksums associated with every file and "directory" in the tree. if zarr ignores any irrelevant files we may even consider computing and storing the checksums locally or in some zipped checksum store (to prevent inode explosion) and if this works we may propose a tree hash scheme for diff detection. if you already have any conversations on diff detection, would love to know. sharding support would be fantastic and would really help optimize the nested directory structure to minimize the number of files. i'm hoping this won't break any xarray type access when it's implemented and would be transparent to any end user. given the datasets we are handling the current recommended chunk size is 64**3 and that's resulting in about a million files per zarr store. |
In that case you may be interested in the conversation in #392 (comment) and zarr-developers/zarr-specs#82. IPFS solves this problem very elegantly, and a lot of us are interested in plugging Zarr into IPFS. |
i love ipfs (at least the concept), the efficiency is not quite there yet for practical use. yes, ipfs would solve several of these things. we have a bottleneck in that ipfs would require a client running in front of it, and since we are using a public dataset program, we have some constraints in terms of how to support it. we are indeed considering ipfs (or its variants) as a part of an institutional infrastructure across universities. i'll check in on those conversations. |
Hi @satra, A few quick answers while we see if anyone else in the community has built anything.
In terms of the metadata, I'd believe so. zarr-python tends to be fairly lenient about the chunks until access (and missing chunks are considered legitmate)
The files that are relevant to the store are quite limited. If you everything but
Not that I know of. See also #392 Edit: interesting! I didn't see any of the previous responses when I was responding... |
It looks like this is now being addressed by zarr_checksum. Is that right @satra? |
@jakirkham - indeed that's a tree hash algo we implemented for our needs and using that digest for files in dandi. it's a pure object based hash with no semantics. we may in the future also want to consider an isomorphic hash, where the bits can change, but the content is the same (e.g. moving from uint8 to uint16). also given the sizes of file trees, we may want to consider ways to optimize both hash check and diff detection. i'll close this for now. i had completely forgotten about this issue, so thank you @jakirkham |
@satra you might be interested in pydantic-zarr. It's designed to normatively represent zarr hierarchies. I think some of the things you are looking for could be built with this library, and its very small (right now), so you could just implement the same functionality in your own tooling very easily without adding it as a dependency. |
thanks @d-v-b looks nice and would be easy to incorporate since we already have a pydantic based setup for our schema. a possibility that we are experimenting with in a few projects is to use linkml that abstracts out the metadata model into a yaml definition and then uses generators to create various toolkits (amongst it pydantic). there are many little issues at this point, but they have effectively collapsed a lot of the patterns we use across projects into a single markup language + generators. |
is there anything specific you'd need from zarr-python to make this easier? something on my wishlist is a specification for a JSON-serializable representation of a zarr hierarchy, which would make |
@d-v-b - sorry for the very late response. indeed linkml's data model would allow that and i know some of the linkml folks are in conversation with the NWB folks regarding array data type in linkml as well. here is an intro talk covering basics of linkml: https://zenodo.org/record/7778641 i think it would be a good opportunity to turn the zarr spec into a data model that may fit in with many different worlds of use cases. |
This is indeed actively being worked upon within the LinkML team at the moment. Just tagging @rly who is currently involved in this. |
i could not find any explicit reference in the documentation to validating a zarr store, hence opening this issue.
we are supporting zarr nested directory stores as a file type for our data archive and looking to validate and inspect the structure of the input before upload. some questions have come up that i am posting here:
The text was updated successfully, but these errors were encountered: