Parquet: Fixes get null values for the nested field partition column #4627

ConeyLiu · 2022-04-25T04:42:32Z

We use ConstantReader for the partition column, and the ConstantReader field column is NullReader.NULL_COLUMN. When the ConstantReader or it's parent(when the parent has the only the constant children) wrapped into OptionReader, the OptionReader will always return null values because the following code:

@Override
public T read(T reuse) {
  if (column.currentDefinitionLevel() > definitionLevel) {  // the `ConstantReader.currentDefinitionLevel` is always 0
    return reader.read(reuse);
  }

  for (TripleIterator<?> child : children) {
    child.nextNull();
  }

  return null;
}

Closes #4626

ConeyLiu · 2022-04-25T04:43:38Z

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormat.java

@@ -126,6 +131,48 @@ public void testBasicProjection() throws IOException {
    TestHelpers.assertRows(result, expected);
  }

+  @Test
+  public void testReadPartitionColumn() throws Exception {
+    Assume.assumeTrue("Temporary skip ORC", FileFormat.ORC != fileFormat);


Temporary skip ORC test because #4599 is working on fixing it.

ConeyLiu · 2022-04-25T04:44:04Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java

+
+  @Test
+  public void testReadPartitionColumn() throws Exception {
+    Assume.assumeTrue("Temporary skip ORC", !"orc".equals(format));


Same as here too, temporary skip ORC test because #4599 is working on fixing it.

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/StructRecord.java

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/InnerRecord.java

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

ConeyLiu · 2022-04-29T10:39:41Z

Thanks, @kbendick for the detailed code reviewing. Comments have been addressed. Please take another look when you are free.

kbendick

Hey @ConeyLiu! Thanks for the PR. Sorry for the delay in reviewing it, this seems important.

I tested this locally and you are absolutely correct, this is an issue.

Could you rebase this off of latest master @ConeyLiu? There have been changes in TestFlinkInputFormat class it seems (and we're on 1.15 now, technically, but that's not too urgent).

cc @rdblue

kbendick · 2022-05-13T16:34:47Z

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

@@ -112,11 +116,14 @@ public ParquetValueReader<RowData> struct(Types.StructType expected, GroupType s
      List<ParquetValueReader<?>> reorderedFields = Lists.newArrayListWithExpectedSize(
          expectedFields.size());
      List<Type> types = Lists.newArrayListWithExpectedSize(expectedFields.size());
+      // Inferring MaxDefinitionLevel from parent field
+      int inferredMaxDefinitionLevel = type.getMaxDefinitionLevel(currentPath());


Nit: Now that we're putting the field depth into the map maxDefinitionLevelsById, and we have the same check for idToConstant.containsKey(id), do we need to have this fallback?

@kbendick, we could not update the maxDefinitionLevelsById if the fieldType.getId() is null, you could see it at L101.

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java

ConeyLiu · 2022-05-14T03:45:32Z

Thanks, @kbendick @rdblue for the review. The comments have been addressed.

ConeyLiu · 2022-06-11T13:45:04Z

Hi @kbendick, is there any more advice for this?

szehon-ho

Nice observation and fix. It looks good to me. Maybe worth to have some others look at it as well, @RussellSpitzer @aokolnychyi ?

szehon-ho · 2022-11-15T17:21:33Z

parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java

      for (Types.NestedField field : expectedFields) {
        int id = field.fieldId();
        if (idToConstant.containsKey(id)) {
          // containsKey is used because the constant may be null
-          reorderedFields.add(ParquetValueReaders.constant(idToConstant.get(id)));
+          int fieldMaxDefinitionLevel = maxDefinitionLevelsById.getOrDefault(id, inferredMaxDefinitionLevel);


Im kind of new to this code, so curious, why do we take from parent? Is it because the field is indeed null here and we will thus just take parent's definition level?

Should we call it parentMaxDefinitionLevel?

The max definition level of the current column path is: type.getMaxDefinitionLevel(currentPath()) - 1, the children should be type.getMaxDefinitionLevel(currentPath()). So the fieldMaxDefinitionLevel is the inferred children max definition level, I have updated the comments. And it is used only when we could not find the value from maxDefinitionLevelsById.

Nit: I still think the name 'inferred' is a bit confusing, it indicates to me that it's the one that will be chosen. Will it be better to call it 'defaultMaxDefintionLevel' or 'parentMaxDefinitionLevel'?

Also just curious, what is the example of a case where we need this default value? I tried to walk through one example and found the expected field is always in the actual struct.

Changed to defaultMaxDefinitionLevel

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java

pvary · 2022-11-16T11:40:01Z

@ConeyLiu: Thanks for the finding and the fix.
May I ask you to put the fix in the main branches for Spark (3.3) and Flink (1.16) first, and then with another PR we can backport to all of the relevant branches.

Do we have the same issue for ORC and Avro too?

ConeyLiu · 2022-11-17T04:01:16Z

Thanks @szehon-ho @pvary for the review.

May I ask you to put the fix in the main branches for Spark (3.3) and Flink (1.16) first, and then with another PR we can backport to all of the relevant branches.

Updated to Spark 3.3 and Flink 1.16, the Pig part is kept.

Do we have the same issue for ORC and Avro too?

Avro is OK, while ORC has a similar problem at here #4604.

ConeyLiu · 2022-11-18T06:09:22Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java

@@ -148,6 +149,9 @@ public ParquetValueReader<?> struct(
          int id = fieldType.getId().intValue();


@szehon-ho here the fieldType.getId() could be null, I guess this is for a compatible purpose. So I add the int defaultMaxDefinitionLevel = type.getMaxDefinitionLevel(currentPath()); in the following code in case we get a null value from maxDefinitionLevelsById.

Thanks! Yea that's ok to me, just in my understanding I'm not entirely sure when it is ever null.

Sorry would you be able to change the comment as well to "Defaulting to parent max definition level" or something like that? Otherwise patch looks good to me

Thanks, looks good to me.

szehon-ho

It's a small nit so I'll just approve it

szehon-ho · 2022-11-18T15:06:30Z

Thanks @ConeyLiu for fix, and @pvary and @kbendick for additional review

ConeyLiu · 2022-11-21T02:10:16Z

Thanks, @szehon-ho for merging this and all for the review. Will submit a backport PR to other spark/flink versions.

github-actions bot added flink parquet pig spark labels Apr 25, 2022

ConeyLiu commented Apr 25, 2022

View reviewed changes

ConeyLiu changed the title ~~Parquet: Fixes got null values for partition column which paritioned by nested filed~~ Parquet: Fixes get null values for the nested field partition column Apr 25, 2022