Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize removal of extra files in Zarr #1411

Open
yarikoptic opened this issue Feb 23, 2024 · 5 comments
Open

Parallelize removal of extra files in Zarr #1411

yarikoptic opened this issue Feb 23, 2024 · 5 comments
Assignees
Labels
performance Improve performance of an existing feature UX zarr

Comments

@yarikoptic
Copy link
Member

related

@yarikoptic yarikoptic added UX performance Improve performance of an existing feature labels Feb 23, 2024
@jwodder
Copy link
Member

jwodder commented Feb 26, 2024

The client already batches Zarr entry deletions, 100 entries per request. I doubt doing multiple batches in parallel is going to result in faster turnaround from the server.

@yarikoptic
Copy link
Member Author

why? doesn't it handle requests in parallel?

@jwodder
Copy link
Member

jwodder commented Feb 26, 2024

@yarikoptic I can't find why the exact value of 100 was chosen, but I believe the point of the limit is to avoid making the server do too much work on a Zarr at once. Simultaneous requests would therefore mean too much work for the server.

@jjnesbitt @mvandenburgh et alii: Can you confirm or deny that there's no efficiency gain to be had from parallelizing batched Zarr entry deletion requests?

@jjnesbitt
Copy link
Member

@jjnesbitt @mvandenburgh et alii: Can you confirm or deny that there's no efficiency gain to be had from parallelizing batched Zarr entry deletion requests?

That's correct, there would be no efficiency gain. This is for two reasons:

  1. The zarr is locked for update while processing the request, so if two were made in parallel, they would be processed sequentially.
  2. I believe the underlying reason why we limit the request size is because the API makes requests to the S3 API under the hood, in order to delete the requested zarrs files. Allowing for parallel invocation of this could cause transient errors from S3, which would complicate things very much.

Is the current performance of the zarr deletion endpoint causing problems elsewhere?

@yarikoptic
Copy link
Member Author

Is the current performance of the zarr deletion endpoint causing problems elsewhere?

somewhat. From the log of the issue referenced in the OP: #1410

❯ zgrep -e 'Deleting.*files' -e 'DELETE.*zarr' 20240223192140Z-306046.log.gz | head -n 2
2024-02-23T14:36:59-0500 [DEBUG   ] dandi 306046:140253474395840 sub-randomzarrlike/sub-randomzarrlike_junk.zarr: Deleting 226053 files in remote Zarr not present locally
2024-02-23T14:36:59-0500 [DEBUG   ] dandi 306046:140253474395840 DELETE https://api.dandiarchive.org/api/zarr/fd6ab3ea-cff6-4006-a9bf-acfa5d983985/files/
❯ zgrep 'DELETE.*zarr.*files/$' 20240223192140Z-306046.log.gz | tail -n 1
2024-02-23T18:14:40-0500 [DEBUG   ] dandi 306046:140253474395840 DELETE https://api.dandiarchive.org/api/zarr/fd6ab3ea-cff6-4006-a9bf-acfa5d983985/files/

so I believe it took over 3 hours to "merely" to delete (lots of) files in ZARR having finished upload of other files. Here such a drastic action was needed since I changed "chunking" strategy for a zarr, so would not be completely uncommon. So I thought it might be nice to get it speedier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature UX zarr
Projects
None yet
Development

No branches or pull requests

3 participants