Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: add support for null vectors #10953

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

slessard
Copy link

@slessard slessard commented Aug 16, 2024

Add new class NullAccessor to aid reading null columns. A null column is a column that exists in the Iceberg schema, but does not exist in the parquet schema. A column would exist in Iceberg, but not in Parquet when a column is added to the Iceberg schema, but no new rows have been added to the table. See ArrowReadetTest.testReadColumnThatDoesNotExistInParquetSchema for an example.

Closes #10275

sl255051 and others added 23 commits May 7, 2024 18:21
Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException
This test more closely follows the reproduction steps described in issue apache#10275
…eaderTest.java

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
…eaderTest.java

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns
@github-actions github-actions bot added the arrow label Aug 16, 2024
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sl255051 I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor that internally tracks a NullVector.

I would suggest to squash all the commits to a single one and then go through my comments

@sl255051
Copy link
Contributor

sl255051 commented Aug 24, 2024 via email

Those two test helper methods are highly tuned for a specific schema, a schema that does not exist in this test.
@slessard slessard marked this pull request as ready for review September 20, 2024 22:51
@slessard
Copy link
Author

@amogh-jahagirdar This PR is ready for your review

These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.
@slessard
Copy link
Author

@amogh-jahagirdar I am not planning to make any additional changes. Please review

@slessard
Copy link
Author

slessard commented Oct 7, 2024

@RussellSpitzer I'd appreciate it if you could review these changes in iceberg-arrow

Comment on lines +303 to +305
// Alter the table schema by adding a new, optional column.
// Do not add any data for this new column in the one existing row in the table
// and do not insert any new rows into the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be confusing vectorized read paths but I'm curious why this isn't reproducible in Spark vectorized reads? Or is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try and repro in a unit test for Spark and see if it's the case. To be clear I don't want to hold up this PR on that though since it does seem like a legitimate problem based on the test being done here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never tried reproducing the issue in Spark. My guess as to why there was no existing test case is that null vectors isn't a bug so much as it is a subfeature that was never implemented. No sense in creating a test for a subfeature that was knowingly never implemented.

What makes me say null vector support was knowingly never implemented? Look at the removed comment in arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java. That comment wasn't a "this is how this code works" type of comment. In my opinion that comment is a TODO comment.

this.numRows = numRows;
this.constantValue = null;
}

public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
super(icebergField);
super(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of placing an if-statement in the constructor itself, it feels more like we should just have a different constructor or class?

Copy link
Author

@slessard slessard Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer In an earlier draft of this PR I did create a NullVectorHolder class but was told to instead adapt ConstantVectorHolder to work for my case, so I did that.

Here's a patch to add a new constructor, but it has at least one issue. Passing null as the constant value in the existing constructor is no longer allowed. This is a semantic breaking change, though not an API breaking change. What this means is that third party code that upgrades to a version of Apache Iceberg containing this patch that was previously using this constructor and passing in a null constant value will still compile fine but will have a runtime failure. That doesn't seem like an appealing solution to me.

I think adding a new class such as NullVectorHolder has its own issue. What would be the use case for a ConstantVectorHolder containing a null value versus a NullVectorHolder containing a null value?

Which solution would you prefer?

  1. ternary operator in the super call
  2. Overloaded constructor with semantics breaking change
  3. New class, NullvectorHolder, with ambiguous use case
Subject: [PATCH] Changes
---
Index: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
--- a/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java	(revision 163ee62ce52bc9611198325997010c7e2b793c71)
+++ b/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java	(date 1728512295607)
@@ -148,16 +148,21 @@
     }
 
     public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) {
+      super(icebergField);
+      Preconditions.checkNotNull(constantValue, "Constant value cannot be null");
+      this.numRows = numRows;
+      this.constantValue = constantValue;
+    }
+
+    public ConstantVectorHolder(Types.NestedField icebergField, int numRows) {
       super(
-          (null == constantValue) ? new NullVector(icebergField.name(), numRows) : null,
-          icebergField,
-          new NullabilityHolder(numRows));
-      if (null == constantValue) {
-        nullabilityHolder().setNulls(0, numRows);
-      }
+              new NullVector(icebergField.name(), numRows),
+              icebergField,
+              new NullabilityHolder(numRows));
+      nullabilityHolder().setNulls(0, numRows);
 
       this.numRows = numRows;
-      this.constantValue = constantValue;
+      this.constantValue = null;
     }
 
     @Override

@@ -140,12 +141,21 @@ public static class ConstantVectorHolder<T> extends VectorHolder {
private final int numRows;

public ConstantVectorHolder(int numRows) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we keeping this constructor? Is it needed if we can make a NullVectorHolder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor could be deleted but that would be an API breaking change because this is a public constructor in a public class. Which would you prefer, an extra constructor or a breaking change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NullPointerException when using VectorizedArrowReader to read a null column
5 participants