Spark: Apply row-level delete files when reading #1444

rdblue · 2020-09-11T16:54:42Z

This applies row-level delete files while reading in Spark, and refactors the generic read code to share the filter code between Spark and generics.

Moves filter methods from GenericReader to DeleteFilter
Adds SparkDeleteFilter to adapt InternalRow to StructLike for deletes
Adds GenericDeleteFilter for generic rows, which can be reused for IcebergInputFormat
Updates Spark to also decrypt delete files
Adds tests for deletes in Spark based on generics tests

rdblue · 2020-09-11T16:55:40Z

FYI @openinx, @JingsongLi, @prodeezy, @rymurr

rdblue · 2020-09-11T17:07:10Z

data/src/main/java/org/apache/iceberg/data/GenericDeleteFilter.java

+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.InputFile;
+
+public class GenericDeleteFilter extends DeleteFilter<Record> {


@shardulm94, @rdsr, this is a separate public class so that we can use it in IcebergInputFormat. It should be fairly easy to apply the deletes when using generics to read.

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java

RussellSpitzer · 2020-09-11T19:38:25Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+      case ORC:
+      default:
+        throw new UnsupportedOperationException(String.format(
+            "Cannot read %s file: %s", deleteFile.format().name(), deleteFile.path()));


nit: Cannot read %s delete file: %s

Just to note that the problem isn't that the file can't be read, but that the file type is not supported

Updated to "Cannot read deletes: %s is not a supported format: %s"

RussellSpitzer · 2020-09-11T20:49:38Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+  }
+
+  public CloseableIterable<T> filter(CloseableIterable<T> records) {
+    return applyEqDeletes(applyPosDeletes(records));


Is there anything important about the order of applying deletes here? Is the guess here that there will be more Pos Deletes than EqDeletes?

My thought here is just that EqDelete always uses a set check so it's probably going to be cheaper than the possibility that you have to do the streaming check in pos deletes, but maybe I'm thinking about it wrong.

We might want to do minor compaction to transform the equality delete files to position delete files, so I guess position deletes should be more than equality deletes as time goes on.

The records in position deletes are ordered, so the sorted merge-base check wouldn't be more expensive than set based check consider the cost of building the hash set. Right?

I think that equality deletes are more expensive to apply because they require a projection and a set lookup (hash and maybe equality check), and there could be multiple equality deletes to apply. So the idea is to do the cheapest operation first and the most expensive operation last to do fewer expensive filter checks.

do minor compaction to transform the equality delete files to position delete files

@chenjunjiedada do we really need to transform equality-deletes to positional-deletes when doing minor compaction ? Let's take the case in a single bucket:

txn-0: insert-file0, pos-delete-file0, equality-delete-file0;
txn-1: insert-file1, pos-delete-file1, equality-delete-file1;
txn-2: insert-file2, pos-delete-file2, equality-delete-file2;

The insert-file0's posDeletes is [pos-delete-file0, pos-delete-file1, pos-delete-file2] and eqDeletes is [equality-delete-file0, equality-delete-file1, equality-delete-file2].

The insert-file1's posDeletes is [pos-delete-file1, pos-delete-file2] and eqDeletes is [equality-delete-file1, equality-delete-file2]

The insert-file2's posDeletes is [pos-delete-file2] and eqDeletes is [equality-delete-file2]

You mean we will transform the equality-delete-file2 and equality-delete-file1 to a new pos-delete-file3 when doing minior compaciton for txn2 ?

@openinx , Since the cost of merging equality delete file is bigger than merging position delete, so it could be an option for minor compaction. While I haven't think about how we will do that compaction. I guess we should consider the sequence number when compacting the equality delete file, for example, convert equality delete files with same sequence number to a new position delete file.

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java

chenjunjiedada · 2020-09-12T00:28:22Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+import org.apache.parquet.Preconditions;
+
+public abstract class DeleteFilter<T> {
+  private static final long DEFAULT_SET_FILTER_THRESHOLD = 100_000L;


Can this be a table property? So that user could tune according to the executor memory?

I'd also add, since this is based on reader's memory constraints, shouldn't this also be a reader (datasource option) property passed down the scan?

Yes, eventually. I just want to keep these commits small and more focused. We can add more plumbing for config in parallel.

shardulm94 · 2020-09-16T03:56:16Z

data/src/test/java/org/apache/iceberg/data/TestGenericReaderDeletes.java

-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table, "id"); // data is added by the reader to apply the eq deletes
+    StructLikeSet expected = selectColumns(rowSetWithoutIds(29, 89, 122), "id");
+    // data is added by the reader to apply the eq deletes, use StructProjection to remove it from comparison
+    StructLikeSet actual = selectColumns(rowSet(table, "id"), "id");


Any specific reason why we changed this test to remove "data" from comparison?

Yes. That column is not in the requested projection.

In Spark, the column is not in the returned row, so I had to add selectColumns to remove it from the expected rows. After doing that, I realized that this test was validating both id and data, when the only column it should be validating for correctness is id. While we get the same result, the test was more specific than it needed to be. If we were to remove data from the rows produced by the scan, this test would have broken.

openinx · 2020-09-16T07:44:01Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+
+public abstract class DeleteFilter<T> {
+  private static final long DEFAULT_SET_FILTER_THRESHOLD = 100_000L;
+  private static final Schema POS_DELETE_SCHEMA = new Schema(


Q: I saw many other classes also defined their own POS_DELETE_SCHEMA, is it possible to move it to a common class , similar to the MetadataColumn ?

The schema is not fixed because it can contain data rows. This is the projection schema used to read that ignores any row data and it doesn't need to be shared right now. I think this should be the only place where we need this, outside of tests.

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

openinx · 2020-09-16T08:46:19Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+      requiredIds.addAll(eqDelete.equalityFieldIds());
+    }
+
+    Set<Integer> missingIds = Sets.newLinkedHashSet(


Seems there's no need to create another new LinkedHashSet ? we won't modify this missingIds set, right ?

This is mostly to avoid diffing the sets twice, once in isEmpty and once to iterate.

Okay, that make sense.

openinx · 2020-09-16T09:48:00Z

data/src/main/java/org/apache/iceberg/data/GenericReader.java

-    records = applyPosDeletes(records, fileProjection, task.file().path(), posDeletes, task.file());
-    records = applyEqDeletes(records, fileProjection, eqDeletes, task.file());
-    records = applyResidual(records, fileProjection, task.residual());
+    CloseableIterable<Record> records = openFile(task, readSchema);


Q: since we've added few columns which may not be included in projection schema in the readScheme, then Will the record from iterator may have more columns than user expected ?

For example, table test=(a,b,c), the user query the data by select b from test where b >10, while we use a as the equality field id set for equality delete files, this CloseableIterable<Record> will return records with column (a,b) , while users would actually expected return records with column (b) ?

Yes, the records will have more columns than requested. That's why we project the added columns at the end of the record, (b, a) in your example. @shardulm94 pointed this out on the original PR for generics.

For Spark, this is okay because Spark ignores the extra columns if the schema associated with a row doesn't have them. You can see the check here: https://github.com/apache/iceberg/pull/1444/files#diff-7600f4d25cfdef7f5da70e12126b55c7R137-R138

For generics, we just return the larger row to avoid needing to make a copy right now. I don't think it is worth the cost of a copy to remove the columns, so we would need to add the ability to truncate the columns of a GenericRecord. I think that adding that feature to GenericRecord should be done in a separate PR, if we decide that it should be done.

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

openinx · 2020-09-16T12:51:35Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+
+      Iterable<CloseableIterable<Record>> deleteRecords = Iterables.transform(deletes,
+          delete -> openDeletes(delete, deleteSchema));
+      StructLikeSet deleteSet = Deletes.toEqualitySet(


Q: do we need to consider to maintain the deleteSet in a LRU cache if several FileScanTask are located in the same task node in future ?

Reusing the delete set is a good future optimization. We will need to be careful with that, though. I wouldn't want to keep them around any longer than needed because the set could be fairly large. For Spark, we would not want to keep these sets across tasks and may even want to discard sets as they are no longer needed.

data/src/main/java/org/apache/iceberg/data/GenericDeleteFilter.java

spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

aokolnychyi · 2020-09-17T16:10:39Z

Let me do a pass right now.

aokolnychyi

This looks great to me.

I think it would be essential to cache overlapping deletes in a task for reasonable performance. Right now, we open/scan every delete file for each data file. For example, if we have data files F1 (D1, D2), F2 (D1), F3(D1), we will read D1 3 times. Ideally, we should detect this and cache that delete file.

@rdblue, do we plan to also add the merge fallback for equality deletes once we have sort order ids assigned to data and delete files? What if the sort order does not match? Do we plan to resort records locally or simply assume it must fit into memory?

rdblue · 2020-09-17T17:10:42Z

Yes, I think once we have delete and data file metadata tracking the sort order of a file, we should add the merge optimization for equality deletes.

As for fallback, I'm reluctant to provide tools in Iceberg itself for cases where deletes are larger than available memory and must be sorted. Implementing a sort that will spill to disk is an area where we expect processing engines to be much better than Iceberg, so I would hope that we can defer that implementation to the processing engines. This is also an area where a table should be maintained. If sort orders don't match then data or deletes can be rewritten to compact, convert to position deletes, or resort to make the application of deletes mergeable.

In the short term, I suspect this will not be a problem in practice because tables will be maintained. We can revisit the fallback case if we need it.

rdblue · 2020-09-17T17:15:41Z

Thanks for the reviews, everyone! I've merged this. Now we should be unblocked to add support to MR and Flink using the DeleteFilter.

openinx · 2020-09-18T02:19:44Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+      Schema deleteSchema = TypeUtil.select(requiredSchema, ids);
+
+      // a projection to select and reorder fields of the file schema to match the delete rows
+      StructProjection projectRow = StructProjection.create(requiredSchema, deleteSchema);


Q: I saw the StructProjection saying it does not support list or map, so that means we don't support that there's any list or map in equality fields ? For now I think it's OK.

That's right. We should extend it later, but I don't think that many deletes will be by list or map.

openinx · 2020-09-18T02:47:27Z

data/src/main/java/org/apache/iceberg/data/GenericDeleteFilter.java

+
+  @Override
+  protected StructLike asStructLike(Record record) {
+    return asStructLike.wrap(record);


Reconsidered this again, I saw argument record is actually read from the requiredSchema, do we need to wrap this record by the same requiredSchema again ? Pls see https://github.com/apache/iceberg/pull/1444/files#diff-a8a025276b1d93b0830f2ee6c91118efR76.

If the record is actually matching the requiredSchema, then we could just return the record in asStructLike method ? Also we don't have to overwrite pos method in this classes again ?

The purpose of the wrapper here is to translate from Iceberg's generic to the internal representation for values. For example, generics will pass timestamptz as an OffsetDateTime to users, but internally Iceberg uses microseconds from epoch as a long.

rdblue added 3 commits September 11, 2020 09:46

Apply v2 row-level deletes in Spark row-based reads.

acea9bd

Decrypt delete files in Spark using the decryption manager.

acbecc9

Do not produce a new required schema if it is not needed.

61e0840

probot-autolabeler bot added core data spark labels Sep 11, 2020

rdblue requested review from shardulm94, rdsr and aokolnychyi and removed request for shardulm94 September 11, 2020 16:54

rdblue added this to the Row-level Delete milestone Sep 11, 2020

Do not use vectorized read path when applying delete files.

8d2ff5d

rdblue commented Sep 11, 2020

View reviewed changes

Fix license headers.

e2d055c

RussellSpitzer reviewed Sep 11, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/util/TableScanUtil.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Sep 11, 2020

View reviewed changes

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated Show resolved Hide resolved

chenjunjiedada reviewed Sep 12, 2020

View reviewed changes

Fix checkstyle problems in DeleteFilter.

eb4b37b

probot-autolabeler bot added the build label Sep 15, 2020

rdblue added 3 commits September 15, 2020 10:22

Fix checkstyle problem in RowDataReader.

1e00036

Fix review comments.

6583cbd

Fix NPE caused by returning a null Accessor map.

af56542

probot-autolabeler bot added the API label Sep 15, 2020

rdblue added 2 commits September 15, 2020 13:36

Fix Spark 2.4 test failures for missing default database.

389f86b

Handle AlreadyExistsException.

1b90e0b

shardulm94 reviewed Sep 16, 2020

View reviewed changes

openinx reviewed Sep 16, 2020

View reviewed changes

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java Show resolved Hide resolved

openinx reviewed Sep 16, 2020

View reviewed changes

JingsongLi reviewed Sep 16, 2020

View reviewed changes

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java Outdated Show resolved Hide resolved

openinx reviewed Sep 16, 2020

View reviewed changes

data/src/main/java/org/apache/iceberg/data/GenericDeleteFilter.java Show resolved Hide resolved

openinx reviewed Sep 16, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java Show resolved Hide resolved

Use missingIds in file projection, fix test checkstyle.

86389e4

aokolnychyi approved these changes Sep 17, 2020

View reviewed changes

rdblue merged commit 1772f4f into apache:master Sep 17, 2020

openinx reviewed Sep 18, 2020

View reviewed changes

openinx mentioned this pull request Oct 21, 2020

Flink: write the CDC records into apache iceberg tables. #1639

Closed

jshmchenxi mentioned this pull request Mar 9, 2021

Iceberg: Apply row-level delete when reading trinodb/trino#7226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Apply row-level delete files when reading #1444

Spark: Apply row-level delete files when reading #1444

rdblue commented Sep 11, 2020

rdblue commented Sep 11, 2020

rdblue Sep 11, 2020

RussellSpitzer Sep 11, 2020

rdblue Sep 15, 2020

RussellSpitzer Sep 11, 2020

RussellSpitzer Sep 11, 2020

chenjunjiedada Sep 12, 2020

rdblue Sep 14, 2020

openinx Sep 16, 2020

chenjunjiedada Sep 16, 2020

chenjunjiedada Sep 12, 2020

prodeezy Sep 14, 2020

rdblue Sep 14, 2020

shardulm94 Sep 16, 2020

rdblue Sep 16, 2020

openinx Sep 16, 2020

rdblue Sep 16, 2020

openinx Sep 16, 2020

rdblue Sep 16, 2020

openinx Sep 17, 2020

openinx Sep 16, 2020

rdblue Sep 16, 2020 •

edited

Loading

openinx Sep 16, 2020

rdblue Sep 16, 2020

aokolnychyi commented Sep 17, 2020

aokolnychyi left a comment

rdblue commented Sep 17, 2020

rdblue commented Sep 17, 2020

openinx Sep 18, 2020

rdblue Sep 18, 2020

openinx Sep 18, 2020

rdblue Sep 18, 2020

Spark: Apply row-level delete files when reading #1444

Spark: Apply row-level delete files when reading #1444

Conversation

rdblue commented Sep 11, 2020

rdblue commented Sep 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Sep 17, 2020

aokolnychyi left a comment

Choose a reason for hiding this comment

rdblue commented Sep 17, 2020

rdblue commented Sep 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Sep 16, 2020 •

edited

Loading