Description
Apache Iceberg version
1.5.1 (latest release)
Query engine
Other
Please describe the bug 🐞
I am writing a compatibility layer for Teradata so that it can access Iceberg tables stored in AWS S3. I am experiencing what at first glance appears to be a bug in Iceberg, but I'd like to get the opinion of the experts here. To be clear I am using Apache Iceberg 1.5.1 and Apache Arrow 15.0.0.
The problem is I am getting a NullPointerException thrown from GenericArrowVectorFactory.java line 224. The NPE is thrown on line 224 because vector
is null.
throw new UnsupportedOperationException("Unsupported vector: " + vector.getClass());
How do I get to this point? Here's the minimal test case:
Prerequisite:
create table otf920ath (
a INT NOT NULL,
b string(10),
c decimal(12, 3)
)
LOCATION 's3://*******************'
TBLPROPERTIES ('table_type' = 'ICEBERG');
INSERT INTO otf920ath values (1, 'san diego', 1024.025);
ALTER TABLE otf920ath
ADD COLUMNS (a1 int);
repro:
select * from otf920ath;
The above SQL select statement works in AWS Athena, but fails in my code. My code is using an instance of org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator
The cause, as I see it, is that the one row in the table contains only three columns worth of data, but the current table schema defines four columns. Because of this difference in schemas Iceberg creates the following four readers, once for each column respectively:
VecorizedArrowReader
corresponding to column a
VecorizedArrowReader
corresponding to column b
VecorizedArrowReader
corresponding to column c
VecorizedArrowReader$NullVectorReader
corresponding to column a1
Naturally the VecorizedArrowReader$NullVectorReader
instance contains a null
value for the vector. This instance is assigned at VectorizedReaderBuilder.java line 100.
Continuing down the code path Iceberg calls GenericArrowVectorAccessorFactory.getPlainVectorAccessor
. This method checks to see whether vector
is an instance of various *Vector types. Because vector
has a value of null
it is not an instance of any type. Thus this method ends up in its ultimate fallback case and tries to throw an exception:
throw new UnsupportedOperationException("Unsupported vector: " + vector.getClass());
The problem is that vector
is null
and this calling vector.getClass()
throws a NullPointerException
.
The stack trace is:
java.lang.NullPointerException
at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getPlainVectorAccessor(GenericArrowVectorAccessorFactory.java:224)
at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getVectorAccessor(GenericArrowVectorAccessorFactory.java:110)
at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessors.getVectorAccessor(ArrowVectorAccessors.java:54)
at org.apache.iceberg.arrow.vectorized.ColumnVector.getVectorAccessor(ColumnVector.java:136)
at org.apache.iceberg.arrow.vectorized.ColumnVector.<init>(ColumnVector.java:56)
at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:54)
at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:29)
at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:149)
at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:314)
at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:190)
So my questions:
- Is it possible that this is a bug in Iceberg?
- If so, is the fix simply to handle the
null
value forvector
when building the message for the UnsupportedOperationException? - If not, is there some other code path or method arguments I should be using?
p.s. I asked this question in the Slack channel but didn't get any traction. https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1714676216273989