-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimize/optimize Zarr digestion when uploading #923
Conversation
Codecov Report
@@ Coverage Diff @@
## master #923 +/- ##
==========================================
- Coverage 87.41% 87.32% -0.09%
==========================================
Files 61 62 +1
Lines 7454 7662 +208
==========================================
+ Hits 6516 6691 +175
- Misses 938 971 +33
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Also, use the sum of the uploaded file sizes (instead of the Zarr size) as the upload percentage denominator
Also, do most of the digestion in threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some few initial questions/comments
to_delete: List[RemoteZarrEntry] = [] | ||
digesting: List[Future[Optional[Tuple[LocalZarrEntry, str]]]] = [] | ||
yield {"status": "comparing against remote Zarr"} | ||
with ThreadPoolExecutor(max_workers=jobs or 5) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you try this PR on some size-able zarr upload from e.g. hub to our staging S3? I wonder about how long it took e.g. to start with uploading for a fresh new zarr (so nothing on server yet) and for e.g. some interrupted upload after some files were already transmitted; and what speed did you observe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. Do you have a Zarr or script for creating one to recommend using, or should I just jump straight to trying to upload test128.ngff
? (CC: @satra)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also downloaded that sample large-ish (~37GB) zarr @satra uploaded to staging , into the hub: /shared/zarr/b35681cd-936a-477c-ac51-06661612d4df
. Would be interesting to gauge upload speed for that one into staging (as a new zarr of cause), and then interrupt and continue. @satra also placed a helper tool (like dstat) under /shared/io-utils/dool
to monitor "wire bandwidth" (in addition to what client would be reporting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic While testing out the upload, I found an inefficiency: the first thing upload()
does when processing an asset is get its local size, which for Zarrs means iterating over the entire directory tree, which is later iterated over again inside iter_upload()
. Should Zarr uploads delay this calculation until the latter point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic I just finished a complete upload of the given Zarr in a single upload session, which took 63 minutes, the first couple minutes of which were spent in pre-upload directory traversals. The upload itself alternated between uploading files in fast bursts of about 10 seconds each and waiting about 10-20 seconds for /zarr/{zarr_id}/upload/complete/
requests to return. Next I'll try interrupting and resuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is simply for the metadata, i would say the client could provide it but there is no reason for server to trust it. and since the server will compute size and checksum out of band, the server will populate those fields in the metadata. this is similar to the PR that @AlmightyYakob is working on that will also support out of band ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@satra This is not for the metadata; this is for the pyout display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah in that case, i would say don't worry about total size. just provide a summary at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic By "delay", I meant leaving the display of the "SIZE" column for a Zarr blank until all the local entries have been traversed while determining what to upload, at which point (just before starting upload proper)
iter_upload()
can just emit a{"size": total_size}
dict, the size having been calculated while iterating.
that sounds great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After using this branch to upload a 38GB Zarr from the Hub in just over an hour, I renamed the Zarr, started a new upload, let it run a little bit, interrupted it, and then started uploading again. That worked out fine (after a fix). I then tried another upload, without having changed anything, and it took the client almost exactly 30 minutes to compare the local & remote Zarrs and determine that nothing needed to be uploaded. One possible change that would likely make this faster would be if the Zarr directory listings returned from |
Filed dandi/dandi-archive#925 . As a file size change is likely to be a "rare" indicator for having a change within file, I wonder if we should stick to it, or just immediately proceed to checksumming, which is to happen anyways for overall check if there is anything to upload whenever it is exactly the same (at took you 30 minutes), right? Meanwhile, I will just proceed with the merge of this one as is to ease testing etc. |
Closes #913.
Closes #914.
Closes #915.
Closes #926.