-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr pointer to existing files #631
Comments
Just chiming in as a Zarr user who's experimented a bit with adapting existing image formats to zarr. I think this can best be accomplished by creating a custom store that maps Zarr keys to individual images (files). In this case, the store needs to provide some additional metadata (.zarray, .zgroup, etc) that describe the hierarchy for the images that you desire. If they are jpeg/png, the store will need to take care of decoding/encoding the images. I have actually prototyped something quite similar to this to view Deep Zoom Image (DZI) pyramids as Zarr. DZI is a format for image pyramids where each level of a pyramid is a directory of jpeg/png "tiles". Here is the custom store implementation that maps the DZI format to the multiscale zarr specification: https://github.com/manzt/napari-dzi-zarr/blob/master/napari_dzi_zarr/store.py |
This is closely related to #556. |
@rabernat my understanding here is that there are many separate files (jpegs) and the desire is to map those files as Zarr array chunks within a hierarchy. In #556, there is a single binary container and the mapping is to bytes-ranges within the container. Key difference as well is the need for the store to perform decoding/encoding. |
I see your point. In my mind, what they have in common is the desire to "wrap" an existing storage scheme with Zarr. |
Might also find this rough spec extension idea ( zarr-developers/zarr-specs#82 ) of interest 😉 |
thanks, @manzt i think your interpretation is correct (i was actually just looking at your napari project) and zarr-specs#82 indeed seems like it would address my use case.... if I understood it correctly-a very technical explanation. I will keep 👀 on that and explore the napari dzi zarr store. |
@dschneiderch: we have been working on a proposed specification to map collections of binary files to zarr chunks in https://github.com/intake/fsspec-reference-maker. It would be great to get your feedback on whether that would meet your use case. |
As it's recently been merged, the v1 spec should allow you to explicitly map your image files to the zarr data model.
There is an open PR to add v1 to One option is to register a custom codec for zarr to use when accessing array chunks. This means you'll need to specify a |
great to see this! sorry I didn't chime in earlier. I tried:
I also tried to use the json directly in a
and then based on the test in the PR
but that gives i have a basic folder structure: I still need to figure out the encoding/decoding part for png. |
Realizing my response likely confused more than helped. The short answer is that the ReferenceFileSytem provides a formal specification to express the idea of "zarr pointer to existing files". The issue is that you need to write some code to generate this description (in JSON), effectively translating your custom directory structure to the Zarr data model. Let me elaborate. Using your directory of PNGs as an example, we can think of each PNG as a compressed Zarr Array "chunk". It is up to you, how you want to organize these "chunks" in a Zarr hierarchy. You could treat each chunk as an individual Zarr Array, or you could layout each chunk into a single multi-dimensional Zarr Array. The latter is likely how you'd like to use Zarr, but this is only possible if each PNG "chunk" is the same shape. With your two PNGs, we can think of theoretical Zarr Array having the following attributes:
In Zarr, this "Array" is written to a store with the following keys: .
└── data.zarr/
├── .zarray # array metadata (JSON)
├── 0.0.0 # A1doi-20200531T210155-PSII0-1.png
└── 1.0.0 # A1-doi-20200531T210155-PSII0-2.png The issue is that you don't want to rename files, and 1.) Create the missing
Therefore, reference description would look something like:
See how the This Zarr Array metadata can look tedious to write, but Zarr actually has some utilities to write this. I would personally creat this reference in python using the following: # write_reference.py
from zarr.storage import init_array
import imageio
example_chunk = imageio.imread('A1-doi-20200531T210155-PSII0-1.png')
refs = dict()
# writes ".zarray" to refs
init_array(
refs,
shape=(2,) + example_chunk.shape,
chunks=(1,) + example_chunk.shape,
dtype="|u1|",
compressor=None, # ignoring compression for now
)
refs[".zarray"] = refs[".zarray"].decode() # decode bytes as a python string
refs["0.0.0"] = ["{{ path }}/A1-doi-20200531T210155-PSII0-1.png"]
refs["1.0.0"] = ["{{ path }}/A1-doi-20200531T210155-PSII0-2.png"]
spec = dict(
version=1,
templates=dict(path="file://data/psII/dataset-A1-20200531"),
refs=refs
)
with open('reference.json', mode='w') as fh:
fh.write(json.dumps(spec))
Unfortunately, I think this is likely a current limitation of the A note on compressionFinally, I haven't addressed the issue to the "chunks" being encoded as PNG. By default, Zarr uses various codecs from a library called I haven't tried this with PNG (but I have with JPEG). When writing # write_reference.py
from imagecodecs.numcodecs import Png
refs = {}
# writes ".zarray" to refs
init_array(
refs,
shape=(2,) + example_chunk.shape,
chunks=(1,) + example_chunk.shape,
dtype="|u1|",
compressor=Png(), # writes { "id": "imagecodecs_png" }, tells zarr client "get the imagecodecs_png codec to decode each chunk!"
) Then, when you use Zarr, you'll need to run a function from # read_reference.py
import zarr
from imagecodecs.numcodecs import register_codecs
register_codecs() # adds all image codecs to the zarr registry
# use zarr I hope this comment adds some clarity to how to use the ReferenceFileSystem in your situation. One potential issue I see is that the reference file system is read-only at the moment, so if you need the ability to write chunks with different key names (e.g. |
Note on this is a current limitation of ReferenceFileSystem, and so it only works with HTTP, s3, GCS and Azure. Making it also work with local files is totally doable, but (in my opinion) less useful. Its having to download only small parts of potentially massive remote data and getting parallel access to archives that are the bigger wins for fsspec-reference-maker. |
Ok, Thanks for the thorough explanation! However, none of that will work for me without implementation of ReferenceFileSystem for local files, right? Our use is primarily to package the stack of image files as a single object so read-only would be ok. Howver, we need it to work locally. most processing is still happening locally but I was hoping this would allow us to scale up to remote stores too. plantcv only works with a single input. that is, give a list of image files, it handles parallel processing for each file via dask. we have cases where groups of images need to be kept together though so we thought we could use zarr to preprocess the stack and create new "files" containing groups of images without duplicating the data. |
The
The ZarrFileStore does not currently allow to export a fsspec ReferenceFileSytem. Supporting local files in the ReferenceFileSystem would be very useful also for development and testing. |
Using But I saved it without converting to zarr first: Is there a way to save fwiw i would be interested in loading this with xarray with labels for each png. alternatively i could load the pngs into xarray and then save to disk as netcdf (or zarr i guess) but i was trying to avoid duplicating the data. |
conda-forge should work.
The
This will read the images from the whole file sequence into memory and then save it as a separate, writable zarr array.
If you are interested in organizing your files into a higher dimensional zarr array, TiffSequence takes an optional regular expression pattern that matches axes and sequence indices in the file names. That can quite complicated: https://github.com/cgohlke/tifffile/blob/581d7a5d4d7784154066b9f11a0167bc08570b7c/tests/test_tifffile.py#L12686-L12729 |
Fair enough, I'll look into non-async soon. It ought not to be too hard. |
Closing now that tifffile and fsspec support this use case! |
Problem description
I'm looking for a solution that uses existing image files in a more structured manner. I am exploring a couple options and would like to avoid duplicating data. In phenomics we generate time series of images and then process them using computer vision techniques. I end up with reasonably "large" datasets ~10GB - 30GB and we want to maintain the ability to view the images in the filesystem. The images are stored in a db during an experiment and then we pull them to the file system for viewing and processing.
please correct me if this is wrong, but I don't see a method to save a zarr without writing the binary chunks to the store.
zarr seems like it could naturally extend its format by including the path to the actual image (an array called image.png) in the .zarray file that lives in e.g. image.zarr. in my use case the chunk size is naturally just the image shape. If i understand zarr correctly, one of the benefits here would be the ability to use the hierarchy/grouping functionality since some images come in groups. e.g.
on day=1, I have a 300 sec timeseries with 2 images (frame) approximately every 20 secs (parameter). so it would be great to have a way to group images by parameter and then by frame and easily call them by relationship. I have this iteration of images for 10 consecutive days and for 40 different plant barcodes.
other options I'm considering are making a xarray dataset or using metadata text files to virtually group the images (inspired by GDAL VRT
maybe this is something for spec v3 but i'm open to suggestions. tagging collaborator @nfahlgren
Thanks!
Version and installation information
The text was updated successfully, but these errors were encountered: