AWS: Use executor service by default when performing batch deletion of files #5379

amogh-jahagirdar · 2022-07-29T00:12:59Z

Update the S3FileIO to lazily load batches and use the existing deletion threadpool for performing concurrent S3#RemoveObjects calls.

This will be used in subsequent PRs for performing bulk deletes in procedures like removing orphan files, snapshot expiration and purging the data, manifests, old metadata files during table drop.

amogh-jahagirdar · 2022-07-29T00:13:17Z

I'm running AWS integ tests to validate this.

api/src/main/java/org/apache/iceberg/io/SupportsBulkOperations.java

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

amogh-jahagirdar · 2022-07-30T22:54:04Z

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

-    if (!awsProperties.isS3DeleteEnabled()) {
-      return;
+    if (awsProperties.isS3DeleteEnabled()) {
+      SetMultimap<String, String> bucketToObjects = computeBucketToObjects(paths);


We are now eagerly computing the bucket to objects mapping up front, as opposed to before where we would iterate over the paths, keep track of the objects per bucket and if for a given bucket the size of the objects is the batch size, the deletion would get triggered (and the mapping would get removed).

Now, it's all up front, so there would be more memory consumption but I think it should be ok. 1 million objects with a max key size of 1024 bytes is 1 GB representation of paths maintained in memory.

1gb does sounds like a lot , example on the spark driver. Is it still possible to do it via streaming, ie submit the deletion batch to Tasks once it gets full?

can we try to use https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/paginators/ListObjectsV2Iterable.html so that the list is dynamically loaded instead of buffered upfront

@jackye1995 I don't think we need a separate API for listing actually, we are already given the iterable of paths. This iterable can be the ListObjectsV2Iterable (which is done as of today in the deletePrefix).

For lazily loading in memory, and still deleting in a concurrent manner we can do the following:

1.) Still use the previous approach and constructing the bucket/key from the path and keeping track of when the objects for a bucket hits a certain batch size.

2.) Instead of using task framework, submit the deletion to an executor service and just keep track of the future. we don't want to use Tasks because it will wait for the completion internally.

We want to just submit the batch deletion and move on, and then at the very end check the statuses of all those tasks. So using the underlying executor service fits that pattern. Let me know what you think.

Will update the PR.

Sounds good!

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java

aws/src/integration/java/org/apache/iceberg/aws/s3/TestS3FileIOIntegration.java

szehon-ho

I think this is probably a valid change, but as of now it doesnt seem to match the pr description ? The pr description is about using an executor service. Can you clarify the pr description?

aws/src/integration/java/org/apache/iceberg/aws/s3/TestS3FileIOIntegration.java

amogh-jahagirdar · 2022-08-02T02:07:59Z

I think this is probably a valid change, but as of now it doesnt seem to match the pr description ? The pr description is about using an executor service. Can you clarify the pr description?

Updated the description so it reflects the new approach. I also removed the integ test fixes, and moved them here #5413 . Thanks!

aws/src/integration/java/org/apache/iceberg/aws/s3/TestS3FileIOIntegration.java

aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

aokolnychyi · 2022-08-10T05:01:30Z

Let me take another look tomorrow. Sorry for the delay!

singhpk234 · 2022-08-10T04:45:24Z

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

      }
+    } catch (Exception e) {


[doubt] Any reason we are catching a generic exception here ?

Yeah I think in case of any failure we should surface a BulkDeletionFailure at the end. So catching the generic exception allows us to handle any failure, treat it as a failure to delete the entire batch, and add that to the failed files list. I'm not sure of any other case where we want to surface something else. We're logging the specific exception so that folks can debug.

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

aokolnychyi

This change looks good to me.

After a closer look, passing an explicit executor started to make more sense to me. I think we may add that overloaded method back once we consume these changes in other places.

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

…eletion

amogh-jahagirdar · 2022-08-10T21:32:54Z

This change looks good to me.

After a closer look, passing an explicit executor started to make more sense to me. I think we may add that overloaded method back once we consume these changes in other places.

Right, this is my thought as well. Once we see there is a need for the API we can add that, it's harder to go the other way.

Thanks for the review @aokolnychyi !

jackye1995 · 2022-08-10T23:03:01Z

I think we get enough approvals, thanks for the work @amogh-jahagirdar , and thanks everyone for the review!

amogh-jahagirdar · 2022-08-10T23:04:12Z

Thanks everyone for the reviews! @jackye1995 @singhpk234 @aokolnychyi @szehon-ho

(cherry picked from commit f6d9ddc)

github-actions bot added API AWS labels Jul 29, 2022

amogh-jahagirdar force-pushed the bulk-delete-with-size branch from 0e0eaed to 8ec83dd Compare July 29, 2022 01:19

amogh-jahagirdar changed the title ~~API, AWS: Add deleteFiles which accepts batch size to SupportsBulkOperations~~ API, AWS: Add deleteFiles which accepts a thread pool to be used during deletion Jul 29, 2022

amogh-jahagirdar force-pushed the bulk-delete-with-size branch 3 times, most recently from d44f31e to 7183866 Compare July 29, 2022 01:37

amogh-jahagirdar changed the title ~~API, AWS: Add deleteFiles which accepts a thread pool to be used during deletion~~ API, AWS: Add deleteFiles which accepts an ExecutorService to be used during deletion Jul 29, 2022

aokolnychyi reviewed Jul 29, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/io/SupportsBulkOperations.java Outdated Show resolved Hide resolved

jackye1995 reviewed Jul 29, 2022

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the bulk-delete-with-size branch from 7183866 to 389747b Compare July 29, 2022 23:03

amogh-jahagirdar changed the title ~~API, AWS: Add deleteFiles which accepts an ExecutorService to be used during deletion~~ AWS: Use executor service by default when performing batch deletion of files Jul 29, 2022

amogh-jahagirdar force-pushed the bulk-delete-with-size branch from 389747b to 0f566fe Compare July 30, 2022 22:44

amogh-jahagirdar commented Jul 30, 2022

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jul 30, 2022

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jul 30, 2022

View reviewed changes

aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the bulk-delete-with-size branch from 0f566fe to a4e36c0 Compare July 31, 2022 00:11

amogh-jahagirdar commented Jul 31, 2022

View reviewed changes

aws/src/integration/java/org/apache/iceberg/aws/s3/TestS3FileIOIntegration.java Outdated Show resolved Hide resolved

amogh-jahagirdar requested review from aokolnychyi and jackye1995 August 1, 2022 04:43

danielcweeks self-requested a review August 1, 2022 20:22

szehon-ho reviewed Aug 1, 2022

View reviewed changes

aws/src/integration/java/org/apache/iceberg/aws/s3/TestS3FileIOIntegration.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the bulk-delete-with-size branch 2 times, most recently from 5be1059 to 639acf1 Compare August 2, 2022 02:06

szehon-ho mentioned this pull request Aug 2, 2022

API, Spark: Update remove orphan files procedure to use bulk deletion if applicable #5373

Closed

amogh-jahagirdar force-pushed the bulk-delete-with-size branch 2 times, most recently from c7059a1 to e399c4e Compare August 5, 2022 20:51

amogh-jahagirdar force-pushed the bulk-delete-with-size branch 4 times, most recently from 316d6d9 to 2856de4 Compare August 5, 2022 20:58

amogh-jahagirdar commented Aug 5, 2022

View reviewed changes

aws/src/integration/java/org/apache/iceberg/aws/s3/TestS3FileIOIntegration.java Show resolved Hide resolved

amogh-jahagirdar force-pushed the bulk-delete-with-size branch 3 times, most recently from 0922b40 to eeeda0c Compare August 5, 2022 23:03

jackye1995 reviewed Aug 5, 2022

View reviewed changes

aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIO.java Show resolved Hide resolved

jackye1995 reviewed Aug 5, 2022

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the bulk-delete-with-size branch 2 times, most recently from 95ad675 to 63374cb Compare August 6, 2022 00:01

jackye1995 approved these changes Aug 6, 2022

View reviewed changes

amogh-jahagirdar requested a review from szehon-ho August 9, 2022 19:51

szehon-ho approved these changes Aug 9, 2022

View reviewed changes

singhpk234 reviewed Aug 10, 2022

View reviewed changes

aokolnychyi approved these changes Aug 10, 2022

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated Show resolved Hide resolved

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated Show resolved Hide resolved

aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Show resolved Hide resolved

AWS: S3FIleIO will by default use thread pool when performing batch d…

a29ba42

…eletion

amogh-jahagirdar force-pushed the bulk-delete-with-size branch from 63374cb to a29ba42 Compare August 10, 2022 21:11

jackye1995 merged commit 3c6878f into apache:master Aug 10, 2022

rdblue added a commit that referenced this pull request Sep 3, 2022

AWS: Ignore flaky test; backport from #5379.

18758c1

rdblue added a commit that referenced this pull request Sep 3, 2022

AWS: Ignore flaky test; backport from #5379.

f6d9ddc

InvisibleProgrammer pushed a commit to InvisibleProgrammer/iceberg that referenced this pull request Mar 10, 2023

AWS: Ignore flaky test; backport from apache#5379.

41add69

(cherry picked from commit f6d9ddc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: Use executor service by default when performing batch deletion of files #5379

AWS: Use executor service by default when performing batch deletion of files #5379

amogh-jahagirdar commented Jul 29, 2022 •

edited

Loading

amogh-jahagirdar commented Jul 29, 2022

amogh-jahagirdar Jul 30, 2022 •

edited

Loading

szehon-ho Aug 2, 2022 •

edited

Loading

jackye1995 Aug 4, 2022

amogh-jahagirdar Aug 4, 2022 •

edited

Loading

jackye1995 Aug 4, 2022

szehon-ho left a comment •

edited

Loading

amogh-jahagirdar commented Aug 2, 2022 •

edited

Loading

aokolnychyi commented Aug 10, 2022

singhpk234 Aug 10, 2022

amogh-jahagirdar Aug 10, 2022

aokolnychyi left a comment

amogh-jahagirdar commented Aug 10, 2022

jackye1995 commented Aug 10, 2022

amogh-jahagirdar commented Aug 10, 2022 •

edited

Loading

AWS: Use executor service by default when performing batch deletion of files #5379

AWS: Use executor service by default when performing batch deletion of files #5379

Conversation

amogh-jahagirdar commented Jul 29, 2022 • edited Loading

amogh-jahagirdar commented Jul 29, 2022

amogh-jahagirdar Jul 30, 2022 • edited Loading

Choose a reason for hiding this comment

szehon-ho Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

jackye1995 Aug 4, 2022

Choose a reason for hiding this comment

amogh-jahagirdar Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

jackye1995 Aug 4, 2022

Choose a reason for hiding this comment

szehon-ho left a comment • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar commented Aug 2, 2022 • edited Loading

aokolnychyi commented Aug 10, 2022

singhpk234 Aug 10, 2022

Choose a reason for hiding this comment

amogh-jahagirdar Aug 10, 2022

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Aug 10, 2022

jackye1995 commented Aug 10, 2022

amogh-jahagirdar commented Aug 10, 2022 • edited Loading

amogh-jahagirdar commented Jul 29, 2022 •

edited

Loading

amogh-jahagirdar Jul 30, 2022 •

edited

Loading

szehon-ho Aug 2, 2022 •

edited

Loading

amogh-jahagirdar Aug 4, 2022 •

edited

Loading

szehon-ho left a comment •

edited

Loading

amogh-jahagirdar commented Aug 2, 2022 •

edited

Loading

amogh-jahagirdar commented Aug 10, 2022 •

edited

Loading