Description
Describe the bug, including details regarding any error messages, version, and platform.
Java import from c-data arrays throws an exception when attempting to construct a vector for which the data buffer is empty.
Example: importing an empty list of Int32 primitives throws the following
Exception in thread "main" java.lang.IllegalArgumentException: Could not load buffers for field $data$: Int(32, true). error message: Buffer 1 for type Int(32, true) cannot be null
at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:131)
at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
at org.apache.arrow.c.ArrayImporter.importArray(ArrayImporter.java:71)
at org.apache.arrow.c.Data.importIntoVector(Data.java:289)
at org.apache.arrow.c.Data.importIntoVectorSchemaRoot(Data.java:332)
at org.apache.arrow.dataset.jni.NativeScanner$NativeReader.loadNextBatch(NativeScanner.java:151)
How to reproduce
Creation of a sample document (Jupyter notebook) :
python: 3.10.4
pyarrow: 12.0.1
pandas: 2.0.3
import pyarrow as pa
import pyarrow.feather as pf
import pandas as pd
schema = pa.schema([
pa.field("a", pa.list_(pa.int32()), True),
])
df = pd.DataFrame(columns=["a"],index=range(1))
df.iloc[0] = [[]]
table = pa.table(df, schema)
pf.write_feather(table, "/tmp/sample.feather", compression="uncompressed")
Access the document via Java DataSet API (kotlin):
JVM: openjdk/20.0.1
arrow: 12.0.1
kotlin: 1.9.0
val allocator = RootAllocator()
val nativeMemoryPool = NativeMemoryPool.getDefault()
val factory = FileSystemDatasetFactory(allocator, nativeMemoryPool, FileFormat.ARROW_IPC, "file:/tmp/sample.feather")
factory.finish().use { dataset ->
dataset.newScan(ScanOptions(10L)).use { scanner ->
scanner.scanBatches().use { reader ->
println("$path: schema=${reader.vectorSchemaRoot.schema.toJson()}")
while (reader.loadNextBatch()) {
println(reader.vectorSchemaRoot.contentToTSVString())
}
}
}
}
Comment
Quick look at the Java code shows that the ArrayImporter class uses an instance of BufferImportTypeVisitor which performs import vector's buffers based on the knowledge of the field data type.
In this case the visit(ArrowType.Int type) method is called which accepts nullable bit mask buffer (here) but demands non-nullable data buffer (here & then here).
As my understanding is from the Vector perspective the data buffer must not be null hence the visitor enforces it, however according to the C data ArrowArray spec it can hold null buffers:
The buffer pointers MAY be null only in two situations:
- for the null bitmap buffer, if ArrowArray.null_count is 0;
- for any buffer, if the size in bytes of the corresponding buffer would be 0.
Based on the above, seems to me, the BufferImportTypeVisitor could create an empty data buffer if the corresponding c data one is null and the filed is empty (fieldNode.length == 0) or throw if the field is not empty.
Component(s)
Java