Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize/optimize Zarr digestion when uploading #923

Merged
merged 14 commits into from
Feb 25, 2022
Merged

Minimize/optimize Zarr digestion when uploading #923

merged 14 commits into from
Feb 25, 2022

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Feb 22, 2022

Closes #913.
Closes #914.
Closes #915.
Closes #926.

@jwodder jwodder added the performance Improve performance of an existing feature label Feb 22, 2022
@codecov
Copy link

codecov bot commented Feb 22, 2022

Codecov Report

Merging #923 (548b7b3) into master (e8c71aa) will decrease coverage by 0.08%.
The diff coverage is 90.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #923      +/-   ##
==========================================
- Coverage   87.41%   87.32%   -0.09%     
==========================================
  Files          61       62       +1     
  Lines        7454     7662     +208     
==========================================
+ Hits         6516     6691     +175     
- Misses        938      971      +33     
Flag Coverage Δ
unittests 87.32% <90.73%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/upload.py 85.85% <85.71%> (+0.67%) ⬆️
dandi/dandiapi.py 89.01% <87.50%> (+0.10%) ⬆️
dandi/files.py 79.37% <88.88%> (-3.56%) ⬇️
dandi/support/threaded_walk.py 92.85% <92.85%> (ø)
dandi/support/digests.py 100.00% <100.00%> (ø)
dandi/support/tests/test_digests.py 100.00% <100.00%> (ø)
dandi/tests/test_dandiapi.py 100.00% <100.00%> (ø)
dandi/tests/test_upload.py 100.00% <100.00%> (ø)
dandi/misctypes.py 61.68% <0.00%> (-0.94%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8c71aa...548b7b3. Read the comment docs.

Also, use the sum of the uploaded file sizes (instead of the Zarr size) as the
upload percentage denominator
@jwodder jwodder changed the title Minimize Zarr digestion Minimize/optimize Zarr digestion Feb 22, 2022
@jwodder jwodder changed the title Minimize/optimize Zarr digestion Minimize/optimize Zarr digestion when uploading Feb 23, 2022
@jwodder jwodder marked this pull request as ready for review February 23, 2022 14:42
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some few initial questions/comments

dandi/support/digests.py Show resolved Hide resolved
dandi/support/digests.py Show resolved Hide resolved
to_delete: List[RemoteZarrEntry] = []
digesting: List[Future[Optional[Tuple[LocalZarrEntry, str]]]] = []
yield {"status": "comparing against remote Zarr"}
with ThreadPoolExecutor(max_workers=jobs or 5) as executor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you try this PR on some size-able zarr upload from e.g. hub to our staging S3? I wonder about how long it took e.g. to start with uploading for a fresh new zarr (so nothing on server yet) and for e.g. some interrupted upload after some files were already transmitted; and what speed did you observe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. Do you have a Zarr or script for creating one to recommend using, or should I just jump straight to trying to upload test128.ngff? (CC: @satra)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also downloaded that sample large-ish (~37GB) zarr @satra uploaded to staging , into the hub: /shared/zarr/b35681cd-936a-477c-ac51-06661612d4df . Would be interesting to gauge upload speed for that one into staging (as a new zarr of cause), and then interrupt and continue. @satra also placed a helper tool (like dstat) under /shared/io-utils/dool to monitor "wire bandwidth" (in addition to what client would be reporting)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic While testing out the upload, I found an inefficiency: the first thing upload() does when processing an asset is get its local size, which for Zarrs means iterating over the entire directory tree, which is later iterated over again inside iter_upload(). Should Zarr uploads delay this calculation until the latter point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic I just finished a complete upload of the given Zarr in a single upload session, which took 63 minutes, the first couple minutes of which were spent in pre-upload directory traversals. The upload itself alternated between uploading files in fast bursts of about 10 seconds each and waiting about 10-20 seconds for /zarr/{zarr_id}/upload/complete/ requests to return. Next I'll try interrupting and resuming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is simply for the metadata, i would say the client could provide it but there is no reason for server to trust it. and since the server will compute size and checksum out of band, the server will populate those fields in the metadata. this is similar to the PR that @AlmightyYakob is working on that will also support out of band ingestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satra This is not for the metadata; this is for the pyout display.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah in that case, i would say don't worry about total size. just provide a summary at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic By "delay", I meant leaving the display of the "SIZE" column for a Zarr blank until all the local entries have been traversed while determining what to upload, at which point (just before starting upload proper) iter_upload() can just emit a {"size": total_size} dict, the size having been calculated while iterating.

that sounds great to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satra - don't worry about size. But we need to figure out what to do with /complete for finishing a batch upload for zarr which seems to add ~200% overhead on top of transfer time in @jwodder's experiment.

@jwodder
Copy link
Member Author

jwodder commented Feb 24, 2022

After using this branch to upload a 38GB Zarr from the Hub in just over an hour, I renamed the Zarr, started a new upload, let it run a little bit, interrupted it, and then started uploading again. That worked out fine (after a fix). I then tried another upload, without having changed anything, and it took the client almost exactly 30 minutes to compare the local & remote Zarrs and determine that nothing needed to be uploaded.

One possible change that would likely make this faster would be if the Zarr directory listings returned from /zarr/{zarr_id}.zarr/{dirpath}/ requests included file sizes; they currently do not, which means the client has to make a HEAD request for every remote entry it wants to get the size of.

@yarikoptic
Copy link
Member

One possible change that would likely make this faster would be if the Zarr directory listings returned from /zarr/{zarr_id}.zarr/{dirpath}/ requests included file sizes; they currently do not, which means the client has to make a HEAD request for every remote entry it wants to get the size of.

Filed dandi/dandi-archive#925 . As a file size change is likely to be a "rare" indicator for having a change within file, I wonder if we should stick to it, or just immediately proceed to checksumming, which is to happen anyways for overall check if there is anything to upload whenever it is exactly the same (at took you 30 minutes), right?

Meanwhile, I will just proceed with the merge of this one as is to ease testing etc.

@yarikoptic yarikoptic merged commit a1a7fd5 into master Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-upload performance Improve performance of an existing feature zarr
Projects
None yet
3 participants