-
Notifications
You must be signed in to change notification settings - Fork 33
feature(stores): draft zip file store specification #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Delete a file. | ||
|
||
* Delete a directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #103
In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right, e.g., |
How useful is a ZipStore in practice? Are there a lot of use cases for it? Given how limited it is (no rename/deletion, etc) I am wondering if its worth having a spec for it |
I have support equivalent to zipstore in nczarr in the netcdf-c library. I agree that it does not appear to be |
@joshmoore - do you have suggestions for the spec document that would make this clearer? @zoj613 and @DennisHeimbigner - let's try to avoid making this about alternatives to the ZIP store concept. There are practical reasons to add this (Zarr-Python has long supported a ZIP store interface). Remember, Zarr can support many storage backends. If there are alternatives to experiment with, let's do that in a separate issue. @DennisHeimbigner - I would like to get your feedback on the spec as written. Is it aligned with your netcdf-c implementation? |
* ``get(key) -> value`` : Read and return the contents of the object at | ||
within the archive at path ``key``. | ||
|
||
* ``set(key, value)`` : Write ``value`` as the contents of the file at | ||
into the archive at path ``key . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the use of at within
and at into
in these lines intentional? Sounds like a typo
Thoughts that I have revolving in my head that include:
|
* Each key has a name (sequence of characters) and contents | ||
(sequence of bytes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the keys are relative paths (not prefixed with a /
).
I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file? |
🤷
👍 |
A few downsides of adding the directory:
|
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
This commit adds a ZipStore storage backend as described in the specification zarr-developers/zarr-specs#311 . Note that the implementation loads the entire zip archive into memory so care must be taken to ensure the zip archive is not too big to fit into the machine's memory. To use a ZipStore impelementation that does not load the archive into memory see `examples/zipstore.ml`.
At the recent OME-NGFF Workflows Hackathon, a team has been discussing possible paths towards a "single-file" OME-NGFF standard. Our preferred path would be to build upon zipped zarrs, i.e. related to this PR. Please find below relevant points extracted from our discussions. Apologies for the long reply, happy to turn it into GitHub review/suggestion style if necessary. TL;DR:
Replies to previous comments:
In the bioimaging domain, many researchers tend to prefer individual files over file system directories when handling small to medium-sized image data (cf. TIFF), not least for practical reasons (e.g., file sharing using traditional means, double-click-open/drag-and-drop support in existing tools) and because existing tooling largely isn't ready for handling file system stores. We'd argue this applies to other domains as well. In practice, the limitations of file system stores when handling small data mean that people will archive (i.e., "zip") zarrs either way, independent of whether this is part of the specification or not. Specifying just how zarrs should be archived would enable tool developers to readily implement support for spec-compliant zarr archives, making Zarr a good choice also for their users.
We did not specifically discuss this idea. What would be the benefit of a full-fledged SFFS over archive file formats (which we'd argue are specific instances of SFFSs)? Regarding compression, Zarr itself already supports several codecs. As a side note, we instead discussed the related idea of using a single-file container format (e.g. HDF5) for a second implementation of the OME-NGFF specification (in addition to Zarr) to enable single-file images. However, this would come at the cost of significant development overhead, would eventually necessitate conversion between different "backends", and would risk fragmenting the community (particularly if there are discrepancies in interpretation), so we'd strongly prefer to stay within Zarr territory for single-file OME-NGFF (which the ZipStore would allow us to do). But, as @jhamman rightfully wrote, let's save further discussion on alternatives for another time.
We agree that, depending on the scope of the specification, this draft could (at least in part) apply more generally to any archive file format. Perhaps this could be generalized in a second step, once the ZipStore has been added? For now, we propose limiting the scope of this draft to a specific file format and endorse ZIP for the following reasons:
We too were wondering if it would make sense - in the long term - to separate the interface definition from the on-disk representation. Perhaps the interface definition could be considered an implementation detail, whereas the on-disk representation is more essential to ensure data portability? Not explicitly specifying store operations would also address compatibility issues (e.g. ZIP possibly not supporting in-place update/delete operations in place). More generally, with "non-file system stores" defined, we think that the current specification is missing consistent resource identifier (e.g. URI) schemes and/or alternative means (e.g., file suffix, mime type, magic number, user decision) for delineating on-disk representations/stores. This is particularly relevant in the case of OME-NGFF, where OME-Zarrs may contain multiple images and users may therefore need to specify the path to a specific image within the zip (e.g. for visualization), ideally as part of the resource identifier pointing to the zip file. However, this is not specific to the ZipStore, should in our view not be mixed with the storage specification either, and may well be an "upstream problem" for a more general specification. We thus propose to leave it up to implementations to decide what "store" to use for a given resource for now.
Having a root directory inside a zip file (with the same name as the zip file itself) can quickly become confusing/out of sync if the zip files have been renamed automatically (e.g. upon re-downloading an already existing file) and/or manually. We'd argue that not being able to unpack zip files into the same directory without first (automatically?) creating target root directories is far less confusing than ending up with directory names that may not match the zip file names (and just as in the case of no root folder, depending on tooling, one could still end up accidentally overriding "competing" root folders if they happen to have the same name). We therefore propose to NOT use root directories for archiving zarrs. Specifically, for zarr-specific zip writer implementations, we propose to REQUIRE the creation of archives without a root directory (for above reasons and consistency, also with Zarr v2). However, since zarrs may also be archived using zarr-agnostic tooling, we propose to specify that zarr reader implementations MAY additionally check for single root directories or recursively scan for Additional remarks: The current draft does not specify that existing keys cannot generally be overwritten (to our understanding, this is not generally possible according to the ZIP standard). Should the draft specify the archive file format a bit more precisely, e.g., ZIP64 support (yes), support of empty or spanned zip files (no), supported compression formats (if any)? Perhaps writers should be required to support writing uncompressed ZIP64 files, whereas readers MAY support further compression algorithms? |
Store limitations | ||
================= | ||
|
||
The following limitations for this store are know: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following limitations for this store are know: | |
The following limitations for this store are known: |
thanks for the writeup @jwindhager et al. It might be clarifying to factor out the following two discussions, as I think they are logically separable:
I think question 2 is the kind of thing that really ought to be explored alongside at least one implementation. So my pitch is as follows: we implement an opinionated zarr archive function in Would anyone here be interested in working on such an effort? I do think this requires at least one champion to push forward. As a |
I'm in favor of supporting archive formats and zip. Tensorstore already supports reading but not writing. Zip has some disadvantages in its design but I think they are outweighed by it being such a common format. I agree that there should be no implicit root directory, and while some implementations may do auto-discovery, there should be a canonical url that makes any sub-directories within the zip file explicit. The spec says the canonical url is just a file url, file:///path/to/file.zip. While that is reasonable for implementations that do auto-discovery, I don't think that is a good idea as the canonical url since it does not explicitly indicate the zip format at all, and would rely on implementations detecting it either by the filename or content. Previously I proposed a different url syntax (zarr-developers/zeps#48) which allows nested formats like zip to be specified explicitly. |
IMO zip is not a good single file storage format. Other choices like the various |
This might be true but I think it's orthogonal to the discussion at hand -- we are not trying to find the best archive file format, but rather devise standards to improve the utility of a popular archive file format (zip). Zip being sub-par doesn't bear on the fact that that people want to use it, and that latter fact is what we should build around IMO. |
I think this should be left to the implementation to decide whether to support write operations. Some people might want to rename/overwrite/delete entries in a ZIP as a convenience just like any other store. Sure the ZIP standard does not support this but there are ways to workaround this limitation (although quite inefficient). For example, I have 2 ZipStore implementations that support the full zarr v3 abstract store interface as defined in the core spec. I don't see the benefit of imposing this limitation to implementations. |
Agreed, and in fact I think this spec be made much more concise. I don't think it is necessary to list the supported operations. |
@d-v-b I agree that we could factor out the two discussions (this is what we meant with "separate the interface definition from the on-disk representation") and I would also add the "canonical url" proposal by @jbms to the list (this is roughly what we referred to as "consistent resource identifier schemes"). However, given the way stores are specified in the current spec, I'd pragmatically argue that it's probably easier to get this PR merged in its current form rather than to further broaden the scope / branch out. I also agree that we could push the on-disk representation aspect on the @jbms Thanks a lot for pointing us to your ZEP! We somehow completely missed it in our group's discussions. I think this could address many important issues, also related to zipped Zarr (and thus single-file OME-NGFF), and will give this a read asap.
Fair point. I still think that one could clarify this under "Store limitations" using the right phraseology, but no strong feelings. |
We can iterate quickly in |
Related discussion here: #Zarr > zipstorers |
In addition to not supporting deletion, zip storage would also have the following issues with
The current implementation of ZipStorage in zarr-python seems to hit the first issue when attempting to append (tries to re-write zarr.json, error due to an existing file of the same name):
Should the spec specify that arrays in zip stores are write-once only, so in addition to not being eraseable, they are not resizeable or append-able? I can see some ways to work around the issues and support appends, but they might be inefficient, very complex and/or leaky abstractions that make the storage not entirely opaque to code users. BTW, I've written a comment a #209, since it seems this issue is not the place to discuss alternatives to zip, but I think sqlite would be a good candidate for single-file zarr stores that support updates, appends and deletions. |
i am far from a zip expert but my understanding is that you can modify the contents of a zip archive, at the price of increasing its size. This means one can append to zipped zarr data, e.g., by adding new arrays or groups, but one could also modify existing objects in the archive. Evidently we don't support this in zarr python but that seems like a problem with our implementation, not the zip format. |
I am also not an expert! But indeed, from my understanding you could remove the existing file from the index (footer, in the file structure), append a new file data block, and write a new footer which references the new (overwritten) file, but not the original version. In this way, you end with a block of data somewhere in the file which is unused, this would fall under "inefficient" as mentioned in my previous comment. You could "erase" any file in the same way. I guess there could be a way for implementations to track these unused blocks and reuse them when possible for additions that are smaller or equal in size to the block. Sort of like what a filesystem or memory allocator does, but this can become pretty complex. I mostly wonder if the choice to support these operations or not should be left up to the implementation, or forced by the spec. One way to see it, is that it doesn't affect readers. Whatever the writer has done (left unused blocks from deleted or overwritten files, or not), it is transparent to any eventual reader implementation. I think this is an argument in favor of letting the implementation decide if they support erasing or resizing arrays. |
Another thought on this topic. The Store API, according to the zarr spec, currently defines 3 optional "properties" of a store:
Perhaps it would be relevant to add a addition property named "editable" or, conversely, "append-only"? An append-only store would support |
I feel what you're describing is just an implementation detail and should really be up to package maintainers to decide (As long as it is communicated in the documentation). Adding an "editable" store classification brings nothing new and is already covered by the 3 store classes. |
This is a working draft of the v3 ZIP file store specification.
xref: