-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize removal of extra files in Zarr #1411
Comments
The client already batches Zarr entry deletions, 100 entries per request. I doubt doing multiple batches in parallel is going to result in faster turnaround from the server. |
why? doesn't it handle requests in parallel? |
@yarikoptic I can't find why the exact value of 100 was chosen, but I believe the point of the limit is to avoid making the server do too much work on a Zarr at once. Simultaneous requests would therefore mean too much work for the server. @jjnesbitt @mvandenburgh et alii: Can you confirm or deny that there's no efficiency gain to be had from parallelizing batched Zarr entry deletion requests? |
That's correct, there would be no efficiency gain. This is for two reasons:
Is the current performance of the zarr deletion endpoint causing problems elsewhere? |
somewhat. From the log of the issue referenced in the OP: #1410 ❯ zgrep -e 'Deleting.*files' -e 'DELETE.*zarr' 20240223192140Z-306046.log.gz | head -n 2
2024-02-23T14:36:59-0500 [DEBUG ] dandi 306046:140253474395840 sub-randomzarrlike/sub-randomzarrlike_junk.zarr: Deleting 226053 files in remote Zarr not present locally
2024-02-23T14:36:59-0500 [DEBUG ] dandi 306046:140253474395840 DELETE https://api.dandiarchive.org/api/zarr/fd6ab3ea-cff6-4006-a9bf-acfa5d983985/files/
❯ zgrep 'DELETE.*zarr.*files/$' 20240223192140Z-306046.log.gz | tail -n 1
2024-02-23T18:14:40-0500 [DEBUG ] dandi 306046:140253474395840 DELETE https://api.dandiarchive.org/api/zarr/fd6ab3ea-cff6-4006-a9bf-acfa5d983985/files/ so I believe it took over 3 hours to "merely" to delete (lots of) files in ZARR having finished upload of other files. Here such a drastic action was needed since I changed "chunking" strategy for a zarr, so would not be completely uncommon. So I thought it might be nice to get it speedier. |
related
describes the use-case. I think that removal is going very slow and primarily since we do it serially on groups of keys. Couldn't we parallelize (using the same jobs) and issue bunch of requests (with retries if needed) to that API DELETE endpoint?
The text was updated successfully, but these errors were encountered: