-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial chunk reads #59
Comments
Thank you for the mention! I would still be extremely interested in the feature. I was trying to move my datasets (4D volumetric imaging microscopy) to zarr but then I stopped mainly because of this problem. I constantly need to combine chunks of a size that make sense with our parallelized pipelines for data processing with the option of loading small parts of the chunks and fast visualisation with a viewer that slice them grabbing only the frames that have to be displayed. |
Thanks @jrbourbeau for reviving this. Just a short technical note, there are at least two possible scenarios here, when a compressor is involved.
Scenario 1 could be achieved for some compression codecs, and would require a change to the codec interface, to allow leveraging mechanisms such as the It isn't obvious to me yet whether scenario 2 can be achieved at all, it is technically quite complex. If it is doable, it would require changes both to the codec interface and the storage interface. Scenario 3 is doable and would require changes only to the storage interface. |
A forth scenario (which I'm very interested in) might be:
Does Zarr already work like this for data in cloud storage buckets and for data from a local POSIX filesystem? |
As I understand it, Zarr currently does not support any form of partial chunk reads. But indeed, perhaps it should! One promising way to implement this would to wrap Caterva inside Zarr: zarr-developers/zarr-python#713 |
Zarr does support partial chunk reads! It was implemented by @andrewfulton9 in zarr-developers/zarr-python#667 for data encoded with Blosc! |
The v3 spec defines partial chunk reads and writes, not discussing interactions with codecs so far: https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#abstract-store-interface |
The ability for zarr to support partial chunk reads has come up a couple of times (xref zarr-developers/zarr-python#40, zarr-developers/zarr-python#521). One benefit of supporting this would be improvements to slicing operations that are poorly aligned with chunk boundaries. As @alimanfoo pointed out, some compressors also support partial decompression which would allow for extracting out part of a compressed chunk (e.g. the
blosc_getitem
method in Blosc).One potential starting point would be to add a new method, e.g.
decode_part
, to theCodec
interface. Compressors which don't support partial decompression could have a fallback implementation where the entire chunk is decompressed and then sliced. We would also need a mechanism for mapping chunk indices to the appropriate parameters needed fordecode_part
to extract a part of a chunk.With the current work on the v3.0 spec taking place, I wanted to open this issue to discuss if partial chunk reads are something we'd like to support as a community
The text was updated successfully, but these errors were encountered: