Bulk delete #6682

RussellSpitzer · 2023-01-27T23:27:26Z

Changes all Deleting Spark Actions to use FileIO Bulk Operations, adds Bulk delete to HadoopIO

The basic idea here is all of our deletes should use the bulk api or have their parallelism controlled at the FileIO level primarily. All deletes should use some parallelism by default.

…ailable Previously deletes were handled by a per Action execution service that would be used to parallelize single deletes. In this PR we move the responsibility of performing the deletes and the parallelization of those deletes to the FileIO via SupportsBulkOperations. This deprecates all methods which used to be used for doing single deletes as well as passing executor services to Actions which delete many files.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

...v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

amogh-jahagirdar · 2023-01-28T19:01:07Z

Thanks a ton for closing the loop on this @RussellSpitzer ! Left some comments

aokolnychyi · 2023-01-31T19:58:06Z

I am getting to this today, hopefully.

dramaticlly

Would love to see this merged, thank you @RussellSpitzer

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

aokolnychyi · 2023-02-02T06:33:36Z

I agree with the overall direction but I'd try to support the existing API to avoid massive deprecation and simplify the implementation. It will be hard to test all possible scenarios.

core/src/main/java/org/apache/iceberg/BaseIncrementalScan.java

api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java

api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java

api/src/main/java/org/apache/iceberg/actions/DeleteReachableFiles.java

api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

aokolnychyi · 2023-03-01T03:58:01Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

+        .onFailure(
+            (f, e) -> {
+              LOG.error("Failure during bulk delete on file: {} ", f, e);
+              failureCount.incrementAndGet();


This is going to increment the count on each failed attempt and won't be accurate. We could count the number of successfully deleted files instead and then use Iterables.size(pathsToDelete) to find how many we were supposed to delete.

Ah I thought it was once per element

I'm not sure we want to go over the iterable more than once ... let me think about this

I double checked this, it only fires off when all retries are exhausted so it is correct as is.

scala> def testFailure() = { var failureCount =0 Tasks.foreach("value") .retry(3) .onFailure((y, x: Throwable) => failureCount += 1) .suppressFailureWhenFinished() .run(x => throw new Exception("ohNO")) failureCount } scala> testFailure() 23/03/01 10:16:22 WARN Tasks: Retrying task after failure: ohNO java.lang.Exception: ohNO ... 23/03/01 10:16:23 WARN Tasks: Retrying task after failure: ohNO java.lang.Exception: ohNO ... 23/03/01 10:16:25 WARN Tasks: Retrying task after failure: ohNO java.lang.Exception: ohNO ... res21: Int = 1

iceberg/core/src/main/java/org/apache/iceberg/util/Tasks.java

Lines 219 to 225 in 715c9b9

runTaskWithRetry(task, item);

succeeded.add(item);

} catch (Exception e) {

exceptions.add(e);

if (onFailure != null) {

tryRunOnFailure(item, e);

Code in question (RunWithRetry) does all retries before hitting "onFailure"

You are right, we overlooked it while reviewing another PR. I like it more. I'll update SparkCleanupUtil to follow this patter as well.

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java

...v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

RussellSpitzer · 2023-03-01T18:06:39Z

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

-        .suppressFailureWhenFinished()
-        .onFailure((file, exc) -> LOG.warn("Failed to delete file: {}", file, exc))
-        .run(deleteFunc::accept);
+    if (deleteFunc == null && table.io() instanceof SupportsBulkOperations) {


This is the new pattern

if (bulk) Bulk else { if no custom delete table.io:: delete If custom delete custom Delete }

This logic is repeated in all of the actions

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

aokolnychyi · 2023-03-02T07:16:35Z

....3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteReachableFilesSparkAction.java

+    if (deleteFunc == null && io instanceof SupportsBulkOperations) {
+      summary = deleteFiles((SupportsBulkOperations) io, files);
+    } else {
+


I actually meant an empty line after DeleteSummary var but formatting here is up to you.
I like the new pattern.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

aokolnychyi · 2023-03-02T07:17:14Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

+
+      if (deleteFunc == null) {
+        LOG.info(
+            "Table IO {} does not support bulk operations. Using non-bulk deletes.", table.io());


...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java

...v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

aokolnychyi

A few non-blocking nits. Looks great otherwise. Thanks, @RussellSpitzer! Feel free to merge whenever you are ready.

RussellSpitzer · 2023-03-02T16:34:56Z

Thanks @amogh-jahagirdar , @dramaticlly and @aokolnychyi I'll merge when tests pass. I'll do the Backport Pr's after my subsurface talk.

This change backports PR #6682 to Spark 3.2.

Previously deletes were handled by a per Action execution service that would be used to parallelize single deletes. In this PR we move the responsibility of performing the deletes and the parallelization of those deletes to the FileIO via SupportsBulkOperations.

This change backports PR apache#6682 to Spark 3.2.

Previously deletes were handled by a per Action execution service that would be used to parallelize single deletes. In this PR we move the responsibility of performing the deletes and the parallelization of those deletes to the FileIO via SupportsBulkOperations.

RussellSpitzer added 2 commits January 25, 2023 15:10

Bulk delete with Orphan Files and Expire Snapshots

cf89f8c

github-actions bot added API core spark labels Jan 27, 2023

amogh-jahagirdar reviewed Jan 28, 2023

View reviewed changes

RussellSpitzer added 2 commits January 30, 2023 17:39

Add default, address review comments

40f4b2a

Spotless

5480073

dramaticlly mentioned this pull request Feb 2, 2023

Spark: Support Bulk deletion in expire-snapshots if fileIO allows #5412

Closed

dramaticlly approved these changes Feb 2, 2023

View reviewed changes

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Show resolved Hide resolved

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Feb 2, 2023

View reviewed changes

Remove Deprecations and new bulkDeleteWith

275ee77

RussellSpitzer commented Feb 3, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseIncrementalScan.java Outdated Show resolved Hide resolved

RussellSpitzer added 2 commits February 3, 2023 17:27

Accidental changewq

01db68d

WIP

c253c51

aokolnychyi reviewed Feb 14, 2023

View reviewed changes

Reviewer Comments

c6b3cb0

RussellSpitzer force-pushed the BulkDelete branch from 125af50 to 46107fd Compare February 28, 2023 19:15

Spotless

cb28cba

RussellSpitzer force-pushed the BulkDelete branch from 46107fd to cb28cba Compare February 28, 2023 19:19

Fix Delete Bug

0f032ca

aokolnychyi reviewed Mar 1, 2023

View reviewed changes

Review Comments

66913e6

RussellSpitzer commented Mar 1, 2023

View reviewed changes

aokolnychyi reviewed Mar 2, 2023

View reviewed changes

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Show resolved Hide resolved

aokolnychyi reviewed Mar 2, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java Show resolved Hide resolved

aokolnychyi reviewed Mar 2, 2023

View reviewed changes

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 2, 2023

View reviewed changes

...v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi approved these changes Mar 2, 2023

View reviewed changes

Last CommentsWq

e81edc0

RussellSpitzer merged commit 5e40182 into apache:master Mar 2, 2023

RussellSpitzer deleted the BulkDelete branch March 2, 2023 17:27

RussellSpitzer added a commit to RussellSpitzer/iceberg that referenced this pull request Mar 8, 2023

Spark 3.2: Backport Bulk Delete for Actions - apache#6682

d01612d

ajantha-bhat mentioned this pull request Mar 9, 2023

Spark-3.3: Handle statistics file clean up from expireSnapshots action/procedure #6091

Merged

aokolnychyi pushed a commit that referenced this pull request Mar 10, 2023

Spark 3.2: Bulk delete support for actions (#7048)

9a76f46

This change backports PR #6682 to Spark 3.2.

jackye1995 mentioned this pull request Mar 10, 2023

API, Spark: Update remove orphan files procedure to use bulk deletion if applicable #5373

Closed

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Spark 3.2: Bulk delete support for actions (apache#7048)

f085715

This change backports PR apache#6682 to Spark 3.2.

aokolnychyi mentioned this pull request Apr 4, 2023

Spark: Close the delete threads pool in some procedures like DeleteOrphan and ExpireSnapshots #7240

Closed

This was referenced Jul 3, 2023

SupportsBulkOperations for ResolvingFileIO #7975

Closed

Core: Extend ResolvingFileIO to support BulkOperations #7976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk delete #6682

Bulk delete #6682

RussellSpitzer commented Jan 27, 2023

amogh-jahagirdar commented Jan 28, 2023

aokolnychyi commented Jan 31, 2023

dramaticlly left a comment

aokolnychyi commented Feb 2, 2023

aokolnychyi Mar 1, 2023

RussellSpitzer Mar 1, 2023

RussellSpitzer Mar 1, 2023

RussellSpitzer Mar 1, 2023

RussellSpitzer Mar 1, 2023

aokolnychyi Mar 2, 2023

RussellSpitzer Mar 1, 2023

aokolnychyi Mar 2, 2023

aokolnychyi Mar 2, 2023

aokolnychyi left a comment

RussellSpitzer commented Mar 2, 2023

	runTaskWithRetry(task, item);
	succeeded.add(item);
	} catch (Exception e) {
	exceptions.add(e);

	if (onFailure != null) {
	tryRunOnFailure(item, e);

Bulk delete #6682

Bulk delete #6682

Conversation

RussellSpitzer commented Jan 27, 2023

amogh-jahagirdar commented Jan 28, 2023

aokolnychyi commented Jan 31, 2023

dramaticlly left a comment

Choose a reason for hiding this comment

aokolnychyi commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

RussellSpitzer commented Mar 2, 2023