Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(stores): draft zip file store specification #311

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Sep 11, 2024

This is a working draft of the v3 ZIP file store specification.

xref:

Comment on lines +88 to +90
* Delete a file.

* Delete a directory.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #103

@joshmoore
Copy link
Member

In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right, e.g.,

@joshmoore
Copy link
Member

cc: @DennisHeimbigner

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

How useful is a ZipStore in practice? Are there a lot of use cases for it? Given how limited it is (no rename/deletion, etc) I am wondering if its worth having a spec for it

@DennisHeimbigner
Copy link

I have support equivalent to zipstore in nczarr in the netcdf-c library. I agree that it does not appear to be
very useful, but the basic idea behind it is reasonable: a single file containing a complete zarr file tree,
and using compression the component files to save space.
Personally, I think that using a single file file system (SFFS) with added compression makes more sense.
There are several implementations available, and it is easy enough to write your own,

@jhamman
Copy link
Member Author

jhamman commented Sep 12, 2024

In my experience, the root of the zip is one of the trickiest parts for data creators (and I assume implementers) to get right...

@joshmoore - do you have suggestions for the spec document that would make this clearer?


@zoj613 and @DennisHeimbigner - let's try to avoid making this about alternatives to the ZIP store concept. There are practical reasons to add this (Zarr-Python has long supported a ZIP store interface).

Remember, Zarr can support many storage backends. If there are alternatives to experiment with, let's do that in a separate issue.


@DennisHeimbigner - I would like to get your feedback on the spec as written. Is it aligned with your netcdf-c implementation?

Comment on lines +107 to +111
* ``get(key) -> value`` : Read and return the contents of the object at
within the archive at path ``key``.

* ``set(key, value)`` : Write ``value`` as the contents of the file at
into the archive at path ``key .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use of at within and at into in these lines intentional? Sounds like a typo

@joshmoore
Copy link
Member

@joshmoore - do you have suggestions for the spec document that would make this clearer?

Thoughts that I have revolving in my head that include:

  • is it really a store spec or is it a format spec?
  • for the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows)
  • for v2, I fully see getting this written down ASAP; for v3, I wonder if a general "archival" format that can be extended for, say, zip/tar/whatever wouldn't be something to consider

Comment on lines +63 to +64
* Each key has a name (sequence of characters) and contents
(sequence of bytes).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the keys are relative paths (not prefixed with a /).

@DennisHeimbigner
Copy link

for the format, the most important item I know of is "don't include the top-level directory" (though I have run into some complaints about that from various repositories, since the behavior differs between implementations, e.g. on Windows)

I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file?
As for the top-level directory, I think it is better to always include it. I say this so that my rule holds, namely:
1.unzipping a zip store creates a directory tree usable by the zarr directory tree storage manager.
2. zipping a zarr directory tree creates a zip store conforming to the proposed zip spec.

@jhamman
Copy link
Member Author

jhamman commented Oct 7, 2024

I think I have always used either linux zip or cygwin zip to create zarr zip files. What native windows program could I use to create a pure windows zip file?

🤷

As for the top-level directory, I think it is better to always include it. I say this so that my rule holds, namely:
1.unzipping a zip store creates a directory tree usable by the zarr directory tree storage manager.
2. zipping a zarr directory tree creates a zip store conforming to the proposed zip spec.

👍

@joshmoore
Copy link
Member

A few downsides of adding the directory:

  • It's a change from v2 😄 though that can be handled.
  • You must know/lookup that value before you open the zip.
  • What is the behavior if there are multiple .zarr directories within the zip?

zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
zoj613 added a commit to zoj613/zarr-ml that referenced this pull request Nov 6, 2024
This commit adds a ZipStore storage backend as described in the
specification zarr-developers/zarr-specs#311 .
Note that the implementation loads the entire zip archive into memory so
care must be taken to ensure the zip archive is not too big to fit into
the machine's memory. To use a ZipStore impelementation that does not
load the archive into memory see `examples/zipstore.ml`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants