Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Use executor service by default when performing batch deletion of files #5379

Merged
merged 1 commit into from
Aug 10, 2022

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Jul 29, 2022

Update the S3FileIO to lazily load batches and use the existing deletion threadpool for performing concurrent S3#RemoveObjects calls.

This will be used in subsequent PRs for performing bulk deletes in procedures like removing orphan files, snapshot expiration and purging the data, manifests, old metadata files during table drop.

@amogh-jahagirdar
Copy link
Contributor Author

I'm running AWS integ tests to validate this.

@amogh-jahagirdar amogh-jahagirdar changed the title API, AWS: Add deleteFiles which accepts batch size to SupportsBulkOperations API, AWS: Add deleteFiles which accepts a thread pool to be used during deletion Jul 29, 2022
@amogh-jahagirdar amogh-jahagirdar force-pushed the bulk-delete-with-size branch 3 times, most recently from d44f31e to 7183866 Compare July 29, 2022 01:37
@amogh-jahagirdar amogh-jahagirdar changed the title API, AWS: Add deleteFiles which accepts a thread pool to be used during deletion API, AWS: Add deleteFiles which accepts an ExecutorService to be used during deletion Jul 29, 2022
@amogh-jahagirdar amogh-jahagirdar changed the title API, AWS: Add deleteFiles which accepts an ExecutorService to be used during deletion AWS: Use executor service by default when performing batch deletion of files Jul 29, 2022
if (!awsProperties.isS3DeleteEnabled()) {
return;
if (awsProperties.isS3DeleteEnabled()) {
SetMultimap<String, String> bucketToObjects = computeBucketToObjects(paths);
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now eagerly computing the bucket to objects mapping up front, as opposed to before where we would iterate over the paths, keep track of the objects per bucket and if for a given bucket the size of the objects is the batch size, the deletion would get triggered (and the mapping would get removed).

Now, it's all up front, so there would be more memory consumption but I think it should be ok. 1 million objects with a max key size of 1024 bytes is 1 GB representation of paths maintained in memory.

Copy link
Collaborator

@szehon-ho szehon-ho Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1gb does sounds like a lot , example on the spark driver. Is it still possible to do it via streaming, ie submit the deletion batch to Tasks once it gets full?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try to use https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/paginators/ListObjectsV2Iterable.html so that the list is dynamically loaded instead of buffered upfront

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 I don't think we need a separate API for listing actually, we are already given the iterable of paths. This iterable can be the ListObjectsV2Iterable (which is done as of today in the deletePrefix).

For lazily loading in memory, and still deleting in a concurrent manner we can do the following:

1.) Still use the previous approach and constructing the bucket/key from the path and keeping track of when the objects for a bucket hits a certain batch size.

2.) Instead of using task framework, submit the deletion to an executor service and just keep track of the future. we don't want to use Tasks because it will wait for the completion internally.

We want to just submit the batch deletion and move on, and then at the very end check the statuses of all those tasks. So using the underlying executor service fits that pattern. Let me know what you think.

Will update the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a valid change, but as of now it doesnt seem to match the pr description ? The pr description is about using an executor service. Can you clarify the pr description?

@amogh-jahagirdar amogh-jahagirdar force-pushed the bulk-delete-with-size branch 2 times, most recently from 5be1059 to 639acf1 Compare August 2, 2022 02:06
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Aug 2, 2022

I think this is probably a valid change, but as of now it doesnt seem to match the pr description ? The pr description is about using an executor service. Can you clarify the pr description?

Updated the description so it reflects the new approach. I also removed the integ test fixes, and moved them here #5413 . Thanks!

@amogh-jahagirdar amogh-jahagirdar force-pushed the bulk-delete-with-size branch 4 times, most recently from 316d6d9 to 2856de4 Compare August 5, 2022 20:58
@amogh-jahagirdar amogh-jahagirdar force-pushed the bulk-delete-with-size branch 3 times, most recently from 0922b40 to eeeda0c Compare August 5, 2022 23:03
@amogh-jahagirdar amogh-jahagirdar force-pushed the bulk-delete-with-size branch 2 times, most recently from 95ad675 to 63374cb Compare August 6, 2022 00:01
@aokolnychyi
Copy link
Contributor

Let me take another look tomorrow. Sorry for the delay!

}
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] Any reason we are catching a generic exception here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think in case of any failure we should surface a BulkDeletionFailure at the end. So catching the generic exception allows us to handle any failure, treat it as a failure to delete the entire batch, and add that to the failed files list. I'm not sure of any other case where we want to surface something else. We're logging the specific exception so that folks can debug.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me.

After a closer look, passing an explicit executor started to make more sense to me. I think we may add that overloaded method back once we consume these changes in other places.

@amogh-jahagirdar
Copy link
Contributor Author

This change looks good to me.

After a closer look, passing an explicit executor started to make more sense to me. I think we may add that overloaded method back once we consume these changes in other places.

Right, this is my thought as well. Once we see there is a need for the API we can add that, it's harder to go the other way.

Thanks for the review @aokolnychyi !

@jackye1995
Copy link
Contributor

I think we get enough approvals, thanks for the work @amogh-jahagirdar , and thanks everyone for the review!

@jackye1995 jackye1995 merged commit 3c6878f into apache:master Aug 10, 2022
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Aug 10, 2022

Thanks everyone for the reviews! @jackye1995 @singhpk234 @aokolnychyi @szehon-ho

rdblue added a commit that referenced this pull request Sep 3, 2022
rdblue added a commit that referenced this pull request Sep 3, 2022
InvisibleProgrammer pushed a commit to InvisibleProgrammer/iceberg that referenced this pull request Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants