MR: apply row-level delete files when reading #1497

chenjunjiedada · 2020-09-23T13:39:47Z

This applies row-level delete files when reading for IcebergInputFormat. This also includes changes from #985.

This also refactors the deletes read unit tests to a separated base test class to avoid duplication.

rdblue · 2020-09-23T23:16:58Z

data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java

@@ -148,15 +144,15 @@ public void testMixedPositionAndEqualityDeletes() throws IOException {
    );

    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
+        table, Files.localOutput(temp.newFile()), TestHelpers.Row.of(0), dataDeletes, dataSchema);


Can you import this class directly to avoid so many changes in this file?

rdblue · 2020-09-23T23:17:10Z

data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java

@@ -269,4 +265,5 @@ private StructLikeSet rowSetWithoutIds(int... idsToRemove) {
        .forEach(set::add);
    return set;
  }
+


Nit: unnecessary whitespace change

rdblue · 2020-09-23T23:19:39Z

mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

@@ -129,7 +133,7 @@
          // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
          checkResiduals(task);
        }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));


While I would like to get the encryption manager and io changes in, I don't think that they should be mixed into this commit. Was it necessary to do this for some reason?

The GenericDeleteFilter constructor needs FileIO as parameter.

In that case, this can pass the FileIO somehow, or we can work on getting the other PR done before this one. But I don't think we should mix the two features together.

For now, this could create a new HadoopFileIO and use that instead. That would be the easiest path forward.

@holdenk just added a FileIO instance to this class, so you can use that instead of mixing the two PRs together. Thanks @holdenk!

Thank you @holdenk @rdblue. I updated this.

rdblue · 2020-09-23T23:22:47Z

data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java

+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );


Why not put the schema and spec in the parent class, DeletesReadTest? The data it generates is for this schema.

rdblue · 2020-09-23T23:25:20Z

data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java

@@ -92,6 +72,22 @@ public void testEqualityDeletes() throws IOException {
    Assert.assertEquals("Table should contain expected rows", expected, actual);
  }

+  protected void generateTestData() throws IOException {
+    this.records = new ArrayList<>();


We prefer using Lists.newArrayList()

rdblue · 2020-09-23T23:25:40Z

data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java

+  public void writeTestDataFile() throws IOException {
+    File tableDir = temp.newFolder();
+    tableDir.delete();
+    this.table = TestTables.create(tableDir, "test", SCHEMA, SPEC, 2);


I think a better way to break down the class would be to have an abstract Table createTable(String name, Schema, Spec) method. Then the table and dataFile fields don't need to be shared. I also don't think that there is a need to make records public either.

Make sense to me. Updated.

rdblue · 2020-09-23T23:27:12Z

mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

@@ -248,6 +258,26 @@ public void close() throws IOException {
      return iterable;
    }

+    @SuppressWarnings("unchecked")


Why is this needed?

deletes.filter(...) needs this.

Okay, makes sense.

rdblue · 2020-09-23T23:27:56Z

mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

+        case GENERIC:
+          DeleteFilter deletes = new GenericDeleteFilter(io, currentTask, tableSchema, readSchema);
+          Schema requiredSchema = deletes.requiredSchema();
+          iter = deletes.filter(openTask(currentTask, requiredSchema));


Why not return deletes.filter(...) here? That would remove the need for iter and break.

rdblue · 2020-09-23T23:29:25Z

mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java

+      }
+    }
+
+    return parameters;


Is there a simpler way to configure this? Normally, we build these using literals instead of a block of code.

Yes, I just updated it.

rdblue · 2020-09-23T23:35:08Z

spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

-    return rowSet(table, "*");
-  }
-
-  private static StructLikeSet rowSet(Table table, String... columns) {


This method is what reads the rows from the table using Spark. Deleting this method and using the one in DeletesReadTest makes this test suite use the exact same read path as the generics -- IcebergGenerics.

You can probably make this method abstract and implement it in both classes to get around this. You'll also need to implement a read using the input format or Hive runner to test the Hive code.

Sorry, I missed this. I just added this back and also use input format to read records.

rdsr

LGTM!. The changes in MR to incorporate deletes are minimal and clean!