-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for .zarr "files" upload #852
Comments
@dchiquito Questions on the zarr upload API:
|
use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912 |
@dchiquito I'm trying to upload a trivial test zarr with the following code: #!/usr/bin/env python3
import json
import os
from pathlib import Path
import sys
from dandi.dandiapi import DandiAPIClient, RESTFullAPIClient
from dandi.support.digests import get_dandietag
from dandi.utils import find_files
dandiset_id = sys.argv[1]
zarrdir = Path(sys.argv[2])
if zarrdir.suffix != ".zarr" or not zarrdir.is_dir():
sys.exit(f"{zarrdir} is not a zarr directory")
with DandiAPIClient.for_dandi_instance(
"dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
r = client.post("/zarr/", json={"name": zarrdir.name})
zarr_id = r["zarr_id"]
zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
upload_body = []
for zf in zfiles:
upload_body.append({
"path": zf.relative_to(zarrdir).as_posix(),
"etag": get_dandietag(zf).as_str(),
})
r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
with RESTFullAPIClient("http://nil.nil") as storage:
for upspec in r:
with (zarrdir / upspec["path"]).open("rb") as fp:
storage.put(upspec["upload_url"], data=fp, json_resp=False)
r = client.post(f"/zarr/{zarr_id}/upload/complete/")
#print(json.dumps(r, indent=4))
d = client.get_dandiset(dandiset_id, "draft", lazy=False)
r = client.post(
f"{d.version_api_path}assets/",
json={"metadata": {"path": zarrdir.name}, "zarr_id": zarr_id},
)
print(json.dumps(r, indent=4)) but I'm getting a 403 response when PUTting the files to S3:
|
Then how do we tell whether something is a zarr directory or not? |
open it as a NestedDirectoryStore in read mode with the zarr-python library. it should be able to give you information about groups and shapes, and metadata. we would essentially validate based on those and for the moment write our own requirements for the ngff files till the ome-zarr-py library implements those. but since this is generic zarr to start with, i would use the xarray or zarr python libraries to read the dataset. |
@satra So we just try to open every single directory with the zarr library and see it if succeeds? |
Thus I think we should
edit 1: I think it is ok to add adds only 2 dependencies -- asciitree and numcodecs (this one seems a bit heavy though). and zarr is in conda-forge$> pip install zarr
Collecting zarr
Using cached zarr-2.10.3-py3-none-any.whl (146 kB)
Requirement already satisfied: numpy>=1.7 in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (1.21.4)
Requirement already satisfied: fasteners in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (0.16.3)
Collecting asciitree
Using cached asciitree-0.3.3-py3-none-any.whl
Collecting numcodecs>=0.6.4
Using cached numcodecs-0.9.1-cp39-cp39-manylinux2010_x86_64.whl (6.4 MB)
Requirement already satisfied: six in ./venvs/dev3/lib/python3.9/site-packages (from fasteners->zarr) (1.16.0)
Installing collected packages: numcodecs, asciitree, zarr
Successfully installed asciitree-0.3.3 numcodecs-0.9.1 zarr-2.10.3 |
i think for the moment this would be fine. |
zarr with also require compression codecs to be added. |
@yarikoptic @satra Note that zarr seems to ignore files with invalid/unexpected names, and a |
for the moment, let's say no empty groups allowed. |
|
FWIW, pretty much we need
if per above we just add |
@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a Also, how should metadata be determined for Zarr assets? (cc @satra) |
Yes |
for the moment i would limit metadata extraction similar to the bids data, so based on names rather than internal metadata. in the future once we get better ngff metadata we will write additional extractors. i can help with the metadata extraction beyond basic metadata (size, encoding type, subject id). for dandi this would be part of bids datasets, so we will have additional info available for sticking into the asset metadata from participants.tsv and samples.tsv files. |
@yarikoptic FYI: I'm implementing the metadata & validation for Zarr by giving NWB, Zarr, and generic files their own classes with metadata and validation methods; however, fscacher doesn't currently support caching of instance methods, so some caching is going to have to be disabled for now. |
hm, do you see how fscacher could gain support for bound methods? if not, I wonder if we shouldn't just concentrate logic in @staticmethods of such classes which would be explicitly passed a path instead of an instance? |
@yarikoptic We would have to add a variant of |
right -- sounds good! somehow it didn't occur to me ;) |
Currently there is no limit. Now that we have the code up on a real server I need to do some experimentation to determine that value. The limit is mostly there to enforce that no request takes longer than 30 seconds, since Heroku will forcibly cancel any requests that exceed that timeout.
The client would need to calculate the checksum to do integrity checking of the zarr upload. I wrote https://github.com/dandi/dandi-api/blob/master/dandiapi/api/zarr_checksums.py to handle all of that logic, and I put it in dandi-api for now for quicker iteration. It should be moved to a common location soon so that the CLI can use it as well.
That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?
I would assume that would be the name of the directory containing the zarr data, unless zarr metadata contains a more descriptive name.
I would say we are using simple MD5, and that for these purposes MD5 and ETag are the same thing. Files in the zarr archive are limited to 5GB so that they can use the simple upload endpoint, and for files uploaded simply (as opposed to multipart), their ETag is the MD5 of the file. DANDI Etags are defined to be the ETags of multipart uploads, which are MD5s with a suffix relating to the number of parts.
Correct. Multipart upload requires substantially more engineering, so the executive decision was made to cap zarr component files at 5GB. Zarr is relatively easy to simply rechunk smaller if that is exceeded.
Sorry, my bad. dandi/dandi-archive#648
That is done directly from S3 using the |
this should be done as is done currently by the CLI for other types of files. the exact details of encodingformat can be worked out in the relevant CLI PR. |
@jwodder I confirm that I am encountering the same error with your script. I will look closer next week. |
@jwodder I got the script working against |
@jwodder I missed a spot with AWS permissions, the staging API has been updated. You need to specify My adaptation of your example looks like this: with DandiAPIClient.for_dandi_instance(
"dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
r = client.post("/zarr/", json={"name": zarrdir.name})
zarr_id = r["zarr_id"]
zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
upload_body = []
for zf in zfiles:
with open(zf, "rb") as f:
blob = f.read()
upload_body.append({
"path": zf.relative_to(zarrdir).as_posix(),
"etag": md5(blob).hexdigest(), # Simple MD5 instead of dandi-etag
})
r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
with RESTFullAPIClient("http://nil.nil") as storage:
for upspec in r:
with (zarrdir / upspec["path"]).open("rb") as fp:
storage.put(upspec["upload_url"], data=fp, json_resp=False, headers={"X-Amz-ACL": "bucket-owner-full-control"}) # X-Amz-ACL header is required
r = client.post(f"/zarr/{zarr_id}/upload/complete/") |
Could you elaborate on how this "integrity checking" fits into the upload flow? Exactly what does the client compare its computed value against? |
@dchiquito could you please clarify above question of @jwodder ? In the example above you posted I do not see zarr "file" integrity checking anywhere. Is providing it optional??? |
@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location? What exactly should pyout display when downloading a multifile Zarr? |
correct edit: with an overall checksum comparison across the tree... if possible to make it "on the fly" as we do with digest for an individual file, but traversing the same order as desired for computation of the zarr checksum, would be awesome!
I think for user reporting (pyout) we could consider zarr file (well -- directory) to be a single asset/file. So % progress would be based on "total" for that directory. Eventually we might want to return/expose some progress in # of files within zarr but I think there is no immediate need for that. |
@yarikoptic @dchiquito What should the Zarr upload do if a Zarr contains an empty directory? There doesn't seem to be a way to inform the API of such a directory's existence, yet the directory is still used in calculating the checksum, and so there will be a checksum mismatch after uploading. |
I don't think well-formed zarr files can contain empty directories. I would assume it's not a thing, but perhaps we should appeal to someone with more knowledge than I. S3 has no concept of "directories", so they can't be empty. My vote is that the checksum calculator should just ignore empty directories and the empty directory will not be considered a part of the zarr. |
@yarikoptic In order to show only a single entry in pyout when downloading a Zarr, I think the code would need to call |
Can |
Attn @satra on above #852 (comment) question about empty directories. Smells like we should ignore them and exclude from checksum compute |
i would agree with that. the only thing is that we should perhaps test what an empty array (https://zarr.readthedocs.io/en/stable/api/creation.html#zarr.creation.empty) looks like in a nested directory store. and whether zarr's checksum of a zarr dataset changes or not. |
That's the basic plan, but the problem is determining the size of the thread pool. Normally, on a quad-core computer, pyout only runs 8 threads at once, so there are only at most 8 concurrent downloads. If we set the size of the Zarr thread pools to some n, we could end up with 8n concurrent downloads — and if we try to avoid that by only downloading one file in a Zarr at a time, we may end up taking longer than normal to download everything if there's ever a point where fewer than 8 downloads are running in total. |
@jwodder - for downloads and uploads isn't the bottleneck in disk and network bandwidth much more than number of cores? could something be done to auto adjust the number of processes ? also when dealing with lee's data i found pyout too constraining by default, especially because the number of jobs it runs is limited by the display of the table. it would be nice if there was an option to not care about a tabular display but just a summary display. it does seem that decoupling download processes from display constraints would be a good step, and may provide a greater ability to optimize it. |
@satra It is indeed possible to configure the maximum number of workers pyout uses when instantiating a table. How exactly do you recommend we determine this value? |
@satra Also, regarding the number of jobs in pyout being limited by the display of the table: It seems that, by default, pyout waits for the top three jobs to complete before adding a new row that would cause the topmost to go off-screen, though this can be configured or disabled. |
@jwodder - i don't a good answer on how to optimize, but perhaps a test download would tell you how many threads optimizes bandwidth given disk constraints (keep increasing the number of threads/processes till it saturates). on clusters where disks may be on different interfaces, they could even be competing with download bandwidth. it may be worthwhile looking up if someone has implemented an algorithm for this. regarding pyout, the issue i ran into was when one of the rows near the top was uploading (say a large file) and everything else was complete, it would not add any more jobs. if the bandwidth was being fully utilized i don't care too much, but in some cases this result in less efficient upload. the other use case is when a lot of small files are being uploaded then the job limit was 10 and could not be increased even though bandwidth wasn't an issue and devices weren't being saturated. hence the notion of removing the tabular display altogether and simply providing the summary display. between optimizing number of threads and removing tabular display, i would prioritize the latter first so that the number of uploads/downloads do not have to be limited to 10. |
re jobs -- I think we should adopt what we have for
so user could explicitly control on how many threads per file (if possible, like in case of zarr) it would be. Default to e.g. 4 and be "done". In the longer run, we are still to RF all the interfacing and pyout should not take care then about parallelizing and only for reporting/UI. |
@yarikoptic So you're suggesting adding a |
yes, that is what I am thinking about. In principle we could even thread downloads of regular files via multiple range requests but I do not see much need in that, so we could just comment that |
@yarikoptic I've written a library for running multiple iterators in threads and collecting the results. Now comes the bikeshedding part: deciding what status dicts should be emitted for a combined set of status dicts. A refresher: The columns shown in the download pyout table are "path", "size", "done", "done%", "checksum", "status", and "message". The status dicts yielded for a given path while it's downloading take the following forms, yielded roughly in the order shown:
Hence the following questions & thoughts:
|
Let's make it "X done, Y errored, Z skipped" with any of those omitted whenever it is just 0.
I would keep it "downloading" until we are done with all files, and then if some failed, yield "error" instead of "done".
my vote would be for "removed" so user would get adequate "done%" as for how much of zarr was downloaded.
we will be setting it for every file, right? then indeed now worth yielding, unless may be for the .zarr/ directory at the end of the process.
"size" should report total zarr (as reported by the server); "done" and "done%" - seen so far.
I think checksum should be reported once for the entire zarr at the end of the download process. |
@yarikoptic I don't think the code should calculate a checksum for the whole Zarr after downloading; digests are already checked for the individual files, and that should be enough. Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded. |
I think it is actually important enough to guarantee consistency of zarr as downloaded to match what is on the server to avoid confusion and e.g. reupload of a zarr with some garbage in it. So, as the last step in the zarr download, it would be good to delete files which are not listed/present in the remote (on server) copy. And then checksum should match. I guess we can rely on fscacher to cache individual files checksumming to make it reasonably fast. |
@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch. |
so for downloaded in current session we would have md5s from "on the fly" digestion right? for those others found in the folder we would need to redigest them from disk. And that is where I thought fscacher would kick in. Or where was I wrong? |
The digests aren't returned from the download manager; they're just discarded after they're checked. |
that's unfortunate. Didn't check the code, but may be you see a reasonably easy way to still make it possible to get their digests for overall zarr digest computation "on the fly"? if not -- we will be doomed to re-digest the entire zarr tree in full upon completion. |
@yarikoptic I figured out a way. |
i will copy the directory over and retry, but perhaps symbolic links should be supported.
|
.zarr
extension is a zarr archive.zarray
or.zgroup
but I wonder if that would be reliable enough?The text was updated successfully, but these errors were encountered: