Spark 3.4: Implement rewrite position deletes #7389

szehon-ho · 2023-04-20T18:44:38Z

This implements the RewritePositionDeleteFiles Interface (already existing) with a Spark action

This action will compact or split position delete files, based on input parameters. Most of the logic is re-used from RewriteDataFiles, via new Rewriter classes added in #7175 . The additional logic here is sorting position deletes locally by 'file_path' and 'pos', as defined in Iceberg spec.

This action will also notably remove 'dangling deletes', ie remove position deletes that no longer have a live data file. Previously this was not possible in any Iceberg action. This is implemented via a left semi-join on 'data_files' table, before the rewrite.

Remaining items: filter() is not yet supported. As the position deletes rewrite is done against the position_deletes metadata table, the filter of data table does not apply. Some work is needed to transform this.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackDataRewriter.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeletesRewrite.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/FileRewriteCoordinator.java

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteGroup.java

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java

aokolnychyi · 2023-04-25T16:16:03Z

api/src/main/java/org/apache/iceberg/RewriteFiles.java

+   * @param filesToAdd files that will be added, cannot be null or empty.
+   * @return this for method chaining
+   */
+  RewriteFiles rewriteDeleteFiles(Set<DeleteFile> filesToDelete, Set<DeleteFile> filesToAdd);


Probably, there is a problem in RewriteFiles right now. I think this API would assign new delete files a brand new data sequence number while we should use the max data sequence number of all rewritten position deletes.

On a side note, I am not sure we can ever rewrite equality deletes across sequence numbers. Let me think.

After thinking more about, we can't rewrite equality deletes across sequence numbers.
I created #7452 to add validation to RewriteFiles.

api/src/main/java/org/apache/iceberg/RewriteFiles.java

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java

core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteGroup.java

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesCommitManager.java

.../spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeletesSparkAction.java

szehon-ho · 2023-04-29T06:22:07Z

Failure is more related to #7422

aokolnychyi

This seems close. My biggest question is about usage of StructLike in maps.

.palantir/revapi.yml

api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java

.../spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeletesSparkAction.java

api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteGroup.java

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesCommitManager.java

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteGroup.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackDataRewriter.java

.../spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeletesSparkAction.java

...park/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java

aokolnychyi

There are a few comments but nothing blocking. This is a huge change so I'll go ahead and merge it. We can address the last comments in a follow-up PR.

Thanks a lot, @szehon-ho! It has been pending for so long!

aokolnychyi · 2023-05-04T22:59:47Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+import org.slf4j.LoggerFactory;
+
+/** Spark implementation of {@link RewritePositionDeleteFiles}. */
+public class RewritePositionDeleteSparkAction


Shouldn't this be called RewritePositionDeleteFilesSparkAction? This is public facing and we usually name it as the interface name + SparkAction.

aokolnychyi · 2023-05-04T23:03:37Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+  }
+
+  @VisibleForTesting
+  RewritePositionDeletesGroup rewriteDeleteFiles(


Just making sure that this is indeed used for testing.

aokolnychyi · 2023-05-04T23:05:34Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+        filesByPartition.put(coerced, partitionTasks);
+      }
+
+      StructLikeMap<List<List<PositionDeletesScanTask>>> fileGroupsByPartition =


for future: Can we explore the idea of having 2 helper methods: one for computing files by partition and another file groups by partition. Not in this PR.

aokolnychyi · 2023-05-04T23:06:24Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+  }
+
+  @VisibleForTesting
+  RewritePositionDeletesCommitManager commitManager() {


Same here about visibility.

aokolnychyi · 2023-05-04T23:09:21Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+      RewritePositionDeletesCommitManager commitManager) {
+    ExecutorService rewriteService = rewriteService();
+
+    // Start Commit Service


minor: Inconsistent usage of capital letters across 3 comments in this method.

// Start Commit Service // Start rewrite tasks // Stop Commit service

aokolnychyi · 2023-05-04T23:10:26Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+      Map<StructLike, List<List<PositionDeletesScanTask>>> groupsByPartition) {
+    Stream<RewritePositionDeletesGroup> rewriteFileGroupStream =
+        groupsByPartition.entrySet().stream()
+            .flatMap(


for future: Can we try refactoring this using some helper methods cause Spotless formats this in a weird way. Not in this PR.

aokolnychyi · 2023-05-04T23:12:36Z

...4/spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteSparkAction.java

+
+    RewriteExecutionContext(
+        Map<StructLike, List<List<PositionDeletesScanTask>>> groupsByPartition) {
+      this.numGroupsByPartition =


Hm, I think we should use StructLikeMap here too.

aokolnychyi · 2023-05-04T23:17:19Z

...park/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java

+    StructLike partition = group.get(0).partition();
+
+    // read the deletes packing them into splits of the required size
+    Dataset<Row> posDeletes =


minor: We frequently add DF suffix for variables referring to Dataset<Row>.

posDeleteDF dataFileDF validPosDeleteDF

.../spark/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeletesSparkAction.java

ajantha-bhat · 2023-05-08T05:50:00Z

@szehon-ho: Are you already working on the CALL procedure for the same? If not, I would like to work on it.

szehon-ho · 2023-05-08T16:57:26Z

Hi @ajantha-bhat , thanks for asking, i havent started yet but I had planned to work on it probably late this week or next week, after some cleanup of this patch. Will you have something in next few days on that?

szehon-ho · 2023-05-09T03:21:24Z

Hi @ajantha-bhat sorry , actually @aokolnychyi pinged me and this should be part of the next release. So I will work on the procedure tomorrow at the first priority, if it is ok

ajantha-bhat · 2023-05-09T09:43:11Z

Hi @ajantha-bhat sorry , actually @aokolnychyi pinged me and this should be part of the next release. So I will work on the procedure tomorrow at the first priority, if it is ok

Ok.

…7565)

This change backports #7389 to Spark 3.3.

chenwyi2 · 2023-08-22T08:46:02Z

when i backport to spark 3.1 with error:
java.lang.IllegalArgumentException: Cannot parse path or identifier: 9a439584-c00f-45a5-9df7-31adfe182900
at org.apache.iceberg.spark.Spark3Util.catalogAndIdentifier(Spark3Util.java:722)
at org.apache.iceberg.spark.Spark3Util.catalogAndIdentifier(Spark3Util.java:713)
at org.apache.iceberg.spark.source.IcebergSource.catalogAndIdentifier(IcebergSource.java:141)
at org.apache.iceberg.spark.source.IcebergSource.extractIdentifier(IcebergSource.java:167)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:288)
i think the problem is
Dataset<Row> posDeletes = spark .read() .format("iceberg") .option(SparkReadOptions.FILE_SCAN_TASK_SET_ID, groupId) .option(SparkReadOptions.SPLIT_SIZE, splitSize(inputSize(group))) .option(SparkReadOptions.FILE_OPEN_COST, "0") .load(groupId);
it seems like spark 3.1 can not read table based on groupId?

github-actions bot added API core spark labels Apr 20, 2023

szehon-ho force-pushed the rewrite_position_deletes branch from cb101a5 to 4cf9cae Compare April 20, 2023 22:13

szehon-ho commented Apr 20, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackDataRewriter.java Show resolved Hide resolved

szehon-ho commented Apr 20, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeletesRewrite.java Outdated Show resolved Hide resolved

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/FileRewriteCoordinator.java Outdated Show resolved Hide resolved

szehon-ho commented Apr 21, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteGroup.java Outdated Show resolved Hide resolved

szehon-ho commented Apr 21, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewriter.java Show resolved Hide resolved

szehon-ho force-pushed the rewrite_position_deletes branch 2 times, most recently from fec93e9 to 0fc63d1 Compare April 21, 2023 07:36

aokolnychyi reviewed Apr 25, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/RewriteFiles.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 25, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/RewriteFiles.java Outdated Show resolved Hide resolved