Minimize/optimize Zarr digestion when uploading #923

jwodder · 2022-02-22T15:01:31Z

Closes #913.
Closes #914.
Closes #915.
Closes #926.

codecov · 2022-02-22T15:06:50Z

Codecov Report

Merging #923 (548b7b3) into master (e8c71aa) will decrease coverage by 0.08%.
The diff coverage is 90.73%.

@@            Coverage Diff             @@
##           master     #923      +/-   ##
==========================================
- Coverage   87.41%   87.32%   -0.09%     
==========================================
  Files          61       62       +1     
  Lines        7454     7662     +208     
==========================================
+ Hits         6516     6691     +175     
- Misses        938      971      +33

Flag	Coverage Δ
unittests	`87.32% <90.73%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
dandi/upload.py	`85.85% <85.71%> (+0.67%)`	⬆️
dandi/dandiapi.py	`89.01% <87.50%> (+0.10%)`	⬆️
dandi/files.py	`79.37% <88.88%> (-3.56%)`	⬇️
dandi/support/threaded_walk.py	`92.85% <92.85%> (ø)`
dandi/support/digests.py	`100.00% <100.00%> (ø)`
dandi/support/tests/test_digests.py	`100.00% <100.00%> (ø)`
dandi/tests/test_dandiapi.py	`100.00% <100.00%> (ø)`
dandi/tests/test_upload.py	`100.00% <100.00%> (ø)`
dandi/misctypes.py	`61.68% <0.00%> (-0.94%)`	⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8c71aa...548b7b3. Read the comment docs.

Also, use the sum of the uploaded file sizes (instead of the Zarr size) as the upload percentage denominator

…gestion

Also, do most of the digestion in threads

…ate cause

yarikoptic

some few initial questions/comments

dandi/support/digests.py

yarikoptic · 2022-02-24T03:56:20Z

dandi/files.py

+            to_delete: List[RemoteZarrEntry] = []
+            digesting: List[Future[Optional[Tuple[LocalZarrEntry, str]]]] = []
+            yield {"status": "comparing against remote Zarr"}
+            with ThreadPoolExecutor(max_workers=jobs or 5) as executor:


did you try this PR on some size-able zarr upload from e.g. hub to our staging S3? I wonder about how long it took e.g. to start with uploading for a fresh new zarr (so nothing on server yet) and for e.g. some interrupted upload after some files were already transmitted; and what speed did you observe?

Not yet. Do you have a Zarr or script for creating one to recommend using, or should I just jump straight to trying to upload test128.ngff? (CC: @satra)

I have also downloaded that sample large-ish (~37GB) zarr @satra uploaded to staging , into the hub: /shared/zarr/b35681cd-936a-477c-ac51-06661612d4df . Would be interesting to gauge upload speed for that one into staging (as a new zarr of cause), and then interrupt and continue. @satra also placed a helper tool (like dstat) under /shared/io-utils/dool to monitor "wire bandwidth" (in addition to what client would be reporting)

@yarikoptic While testing out the upload, I found an inefficiency: the first thing upload() does when processing an asset is get its local size, which for Zarrs means iterating over the entire directory tree, which is later iterated over again inside iter_upload(). Should Zarr uploads delay this calculation until the latter point?

@yarikoptic I just finished a complete upload of the given Zarr in a single upload session, which took 63 minutes, the first couple minutes of which were spent in pre-upload directory traversals. The upload itself alternated between uploading files in fast bursts of about 10 seconds each and waiting about 10-20 seconds for /zarr/{zarr_id}/upload/complete/ requests to return. Next I'll try interrupting and resuming.

if this is simply for the metadata, i would say the client could provide it but there is no reason for server to trust it. and since the server will compute size and checksum out of band, the server will populate those fields in the metadata. this is similar to the PR that @AlmightyYakob is working on that will also support out of band ingestion.

@satra This is not for the metadata; this is for the pyout display.

ah in that case, i would say don't worry about total size. just provide a summary at the end.

@yarikoptic By "delay", I meant leaving the display of the "SIZE" column for a Zarr blank until all the local entries have been traversed while determining what to upload, at which point (just before starting upload proper) iter_upload() can just emit a {"size": total_size} dict, the size having been calculated while iterating.

that sounds great to me.

@satra - don't worry about size. But we need to figure out what to do with /complete for finishing a batch upload for zarr which seems to add ~200% overhead on top of transfer time in @jwodder's experiment.

dandi/support/digests.py

jwodder · 2022-02-24T21:46:24Z

After using this branch to upload a 38GB Zarr from the Hub in just over an hour, I renamed the Zarr, started a new upload, let it run a little bit, interrupted it, and then started uploading again. That worked out fine (after a fix). I then tried another upload, without having changed anything, and it took the client almost exactly 30 minutes to compare the local & remote Zarrs and determine that nothing needed to be uploaded.

One possible change that would likely make this faster would be if the Zarr directory listings returned from /zarr/{zarr_id}.zarr/{dirpath}/ requests included file sizes; they currently do not, which means the client has to make a HEAD request for every remote entry it wants to get the size of.

yarikoptic · 2022-02-25T19:58:32Z

One possible change that would likely make this faster would be if the Zarr directory listings returned from /zarr/{zarr_id}.zarr/{dirpath}/ requests included file sizes; they currently do not, which means the client has to make a HEAD request for every remote entry it wants to get the size of.

Filed dandi/dandi-archive#925 . As a file size change is likely to be a "rare" indicator for having a change within file, I wonder if we should stick to it, or just immediately proceed to checksumming, which is to happen anyways for overall check if there is anything to upload whenever it is exactly the same (at took you 30 minutes), right?

Meanwhile, I will just proceed with the merge of this one as is to ease testing etc.

Don't pre-digest Zarrs during upload

db7c834

jwodder added the performance Improve performance of an existing feature label Feb 22, 2022

jwodder added cmd-upload zarr labels Feb 22, 2022

jwodder added 2 commits February 22, 2022 12:30

Don't compute checksums for uploaded Zarrs

5dc4720

Also, use the sum of the uploaded file sizes (instead of the Zarr size) as the upload percentage denominator

Don't go through fscacher when digesting Zarr entries

7cf9867

jwodder changed the title ~~Minimize Zarr digestion~~ Minimize/optimize Zarr digestion Feb 22, 2022

jwodder added 4 commits February 22, 2022 14:36

Make get_zarr_checksum() use multithreaded directory traversal & di…

c9d8b85

…gestion

Upload: Don't digest LocalZarrEntries until the digest is needed

4c5f4d6

Also, do most of the digestion in threads

Close upload session when done, thereby fixing some bug of indetermin…

b7aff68

…ate cause

Use multithreading when comparing entry digests for Zarr upload

206c7ec

jwodder changed the title ~~Minimize/optimize Zarr digestion~~ Minimize/optimize Zarr digestion when uploading Feb 23, 2022

Cover a spot

f793cfa

jwodder marked this pull request as ready for review February 23, 2022 14:42

Restore (and fix) handling of known parameter to get_zarr_checksum()

099f5bd

yarikoptic reviewed Feb 24, 2022

View reviewed changes

jwodder added 2 commits February 24, 2022 08:57

Document & test get_zarr_checksum()

9c678af

Fix for Windows

1a5bd9c

jwodder force-pushed the gh-915 branch from 7b8b44e to 1a5bd9c Compare February 24, 2022 14:14

jwodder added 2 commits February 24, 2022 13:35

Add a couple upload status dicts

b433508

Fix

94f6351

yarikoptic reviewed Feb 24, 2022

View reviewed changes

dandi/support/digests.py Show resolved Hide resolved

Delay calculation & display of Zarr sizes when uploading

548b7b3

jwodder mentioned this pull request Feb 25, 2022

--digest zarr-checksum incorrectly identifies directory as "empty" #926

Closed

This was referenced Feb 28, 2022

Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) dandi/dandi-archive#931

Closed

zarr_content_read: include sizes for the "files" (and possibly "directories") dandi/dandi-archive#925

Open

yarikoptic merged commit a1a7fd5 into master Feb 25, 2022

yarikoptic deleted the gh-915 branch February 25, 2022 19:58

jwodder mentioned this pull request Mar 24, 2022

Trigger Zarr ingestion after performing an upload #938

Merged

jjnesbitt mentioned this pull request Mar 28, 2022

Investigate zarr upload complete endpoint with varying file sizes dandi/dandi-archive#980

Closed

jwodder mentioned this pull request Jul 27, 2022

upload: make "digesting" in-band for zarr uploads OR just speed/cache zarr checksumming? #903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize/optimize Zarr digestion when uploading #923

Minimize/optimize Zarr digestion when uploading #923

jwodder commented Feb 22, 2022 •

edited by yarikoptic

Loading

codecov bot commented Feb 22, 2022 •

edited

Loading

yarikoptic left a comment

yarikoptic Feb 24, 2022

jwodder Feb 24, 2022

yarikoptic Feb 24, 2022

jwodder Feb 24, 2022

jwodder Feb 24, 2022

satra Feb 24, 2022

jwodder Feb 24, 2022

satra Feb 24, 2022

yarikoptic Feb 24, 2022

yarikoptic Feb 24, 2022

jwodder commented Feb 24, 2022

yarikoptic commented Feb 25, 2022

Minimize/optimize Zarr digestion when uploading #923

Minimize/optimize Zarr digestion when uploading #923

Conversation

jwodder commented Feb 22, 2022 • edited by yarikoptic Loading

codecov bot commented Feb 22, 2022 • edited Loading

Codecov Report

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwodder commented Feb 24, 2022

yarikoptic commented Feb 25, 2022

jwodder commented Feb 22, 2022 •

edited by yarikoptic

Loading

codecov bot commented Feb 22, 2022 •

edited

Loading