Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++/Java] Error when reading inner lists within a struct in empty outer lists from C++/Python in Java #31396

Open
asfimport opened this issue Mar 18, 2022 · 5 comments

Comments

@asfimport
Copy link
Collaborator

When using C++ (or Python) to construct a null or empty outer array of type array_1: list<item: struct<array_sub_col: list<item: string>>>, either:


-  array_1: null
-  array_1: []

an out of bounds exceptions (see stack trace below) follows when later retrieving the field reader for the inner list (array_sub_col) in Java, when trying to access the subsequent offset buffer: https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java#L64

Reproduction

Java: 7.0.0
C++: 7.0.0
Python: 7.0.0

Creating a stream on C++ of type array_1: list<item: struct<array_sub_col: list<item: string>>> with an empty (or null) outer list:

arrow::MemoryPool* pool = arrow::default_memory_pool();
arrow::Result<std::shared_ptr<arrow::io::BufferOutputStream>> stream_buffer =
    arrow::io::BufferOutputStream::Create(1, pool);

std::vector<std::shared_ptr<arrow::Field>> inner_list_field{std::make_shared<arrow::Field>("array_sub_col", arrow::list(arrow::utf8()))};

// Datatype for the builder: list<struct<list<string>>>
std::shared_ptr<DataType> data_type = list(struct_(inner_list_field));

std::unique_ptr<arrow::ArrayBuilder> builder;
arrow::MakeBuilder(pool, data_type, &builder);
auto* list_builder = dynamic_cast<arrow::ListBuilder*>(builder.get());

// Append a null or an empty list to the outer list
list_builder->AppendNull(); // or list_builder->AppendEmptyValue()

std::vector<std::shared_ptr<arrow::Array>> value_batch;
value_batch.resize(1);
list_builder->Finish(&value_batch[0]);

std::vector<std::shared_ptr<arrow::Field>> outer_list_field{std::make_shared<arrow::Field>("array_1",data_type)};
auto schema = std::make_shared<arrow::Schema>(outer_list_field);

// Build a single row record batch
std::shared_ptr<arrow::RecordBatch> batch = RecordBatch::Make(schema, 1, value_batch);
ASSERT_OK(batch->Validate());

// Stream the batch to a file to later read on the Java side
arrow::Result<std::shared_ptr<ipc::RecordBatchWriter>> stream_writer = 
    arrow::ipc::MakeStreamWriter(stream_buffer.ValueOrDie().get(), schema, arrow::ipc::IpcWriteOptions::Defaults());
stream_writer.ValueOrDie()->WriteRecordBatch(*batch);

arrow::Result<std::shared_ptr<arrow::Buffer>> buffer_result = stream_buffer.ValueOrDie()->Finish();
std::shared_ptr<arrow::Buffer> buffer = buffer_result.ValueOrDie();
auto file_output = arrow::io::FileOutputStream::Open("/tmp/batch_stream.out").ValueOrDie();
file_output->Write(buffer->data(), buffer->size());
file_output->Close();

As expected, Python holds the same memory layout for the field vectors as the C++ code above:

array = pa.array([None], type=pa.list_(pa.struct([pa.field("array_sub_col", pa.list_(pa.utf8()))])))
batch = pa.record_batch([struct_array], names=["array_1"])

sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
    writer.write_batch(batch)
buf = sink.getvalue()

with open('/tmp/batch_stream.out', 'wb') as f:
    f.write(buf)

Java fails when then trying to access the inner list's field reader:

File file = new File("/tmp/batch_stream.out");
byte[] bytes = FileUtils.readFileToByteArray(file);
try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(bytes), allocator)) {
     Schema schema = reader.getVectorSchemaRoot().getSchema();
     reader.loadNextBatch();
     readBatch.getVector("array_1").getReader().reader().reader("array_sub_col");                     // <- fails: reader("array_sub_col") fails with OOB

     // Concrete readers:
     // FieldVector array_1 = readBatch.getVector("array_1");
     // UnionListReader array_1_reader = (UnionListReader) array_1.getReader();
     // NullableStructReaderImpl struct_reader = (NullableStructReaderImpl) array_1_reader.reader();
     // FieldReader union_list_reader = struct_reader.reader("array_sub_col");                        // <- fails: OOB

Stack trace:


java.lang.reflect.InvocationTargetException
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:297)
    at java.lang.Thread.run (Thread.java:829)
Caused by: java.lang.IndexOutOfBoundsException: index: 4, length: 4 (expected: range(0, 4))
    at org.apache.arrow.memory.ArrowBuf.checkIndexD (ArrowBuf.java:318)
    at org.apache.arrow.memory.ArrowBuf.chk (ArrowBuf.java:305)
    at org.apache.arrow.memory.ArrowBuf.getInt (ArrowBuf.java:424)
    at com.test.arrow.ValidateArrow.testArrow (ValidateArrow.java:433)
    at com.test.arrow.ValidateArrow.main (ValidateArrow.java:440)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:297)
    at java.lang.Thread.run (Thread.java:829)

Reporter: Arrow User

Note: This issue was originally created as ARROW-15971. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Arrow User:
Performed some early investigation on the potential issue, if it helps:
In C++, the inner list consists of a single offset buffer [0], as no append is ever called on the inner list. Note that, in this case, the outer list is empty, and therefore struct does not append empty values to its children.

The memory layout for the list items is shown as:


array_1: list<item: struct<array_sub_col: list<item: string>>>         < -         offset: [0, 0] ; offset capacity: 8;  value size/capacity: 1;
  child 0, item: struct<array_sub_col: list<item: string>>             
      child 0, array_sub_col: list<item: string>                       < -         offset: [0]    ; offset capacity: 4;  value size/capacity: 0;
          child 0, item: string

C++ explicitly fails on zero length offsets: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_nested.cc#L111, which means that populating a single int32 offset buffer on lists that are never appended to before finishing the builder to an array is expected. C++ also relies on the offset length (offset length - 1) to denote the length of the array to construct during reading on various parts (e.g. ListArrayFromArrays).
A fix for Java on 7.0.0 was introduced: 13b66b5 to handle 0 offset capacity, but the issue remains with the latest versions.
Therefore, would it be expected that the Java code assigns a current and max offset positioning of 0, when encountered with offset builders of capacity <=4 (or ==0 || ==4)? See https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java#L59:

public void setPosition(int index) {
   super.setPosition(index);
-   if (vector.getOffsetBuffer().capacity() == 0) {
+   if (vector.getOffsetBuffer().capacity() <= OFFSET_WIDTH) {
     currentOffset = 0;
     maxOffset = 0;
   } else { ...

@asfimport
Copy link
Collaborator Author

Alessandro Molina / @amol-:
@davisusanibar @lidavidm what do you think? 

@asfimport
Copy link
Collaborator Author

asfimport commented Mar 21, 2022

David Li / @lidavidm:
I think Java is incorrect, yes. The  specification states

The offsets buffer contains length + 1 signed integers (either 32-bit or 64-bit, depending on the logical type), which encode the start position of each slot in the data buffer.

So for a List type, a length 0 list should have 1 32-bit offset value, but Java appears to assume either the offsets buffer will be empty or that it will have at least two values.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
This is basically the same, but I guess now it's a question of semantics: what should setPosition(0); do when the length is 0? Arguably, IndexOutOfBounds exception is logical and the special-casing for an empty buffer is only for an edge case in Java.


>>> pa.array([], pa.list_(pa.int64())).buffers()
[None, <pyarrow.lib.Buffer object at 0x7f8aaf2b8c30>, None, <pyarrow.lib.Buffer object at 0x7f8aaf2b8cb0>]
>>> pa.array([], pa.list_(pa.int64())).buffers()[1].to_pybytes()
b'\x00\x00\x00\x00' 
>>> rb = pa.record_batch([pa.array([], pa.list_(pa.int64()))], names=["a"])
>>> writer = pa.ipc.new_stream('foo.arrows', rb.schema)
>>> writer.write_batch(rb)
>>> writer.close()
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowStreamReader;
import org.apache.arrow.vector.VectorSchemaRoot;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;var allocator = new RootAllocator();
File file = new File("foo.arrows");
FileInputStream inputStream = new FileInputStream(file);
ArrowStreamReader reader = new ArrowStreamReader(inputStream, allocator);
reader.loadNextBatch()
reader.getVectorSchemaRoot().getVector("a").getReader()
var listReader = reader.getVectorSchemaRoot().getVector("a").getReader();
listReader.setPosition(0); 

fails with


|  Exception java.lang.IndexOutOfBoundsException: index: 4, length: 4 (expected: range(0, 4))
|        at ArrowBuf.checkIndexD (ArrowBuf.java:318)
|        at ArrowBuf.chk (ArrowBuf.java:305)
|        at ArrowBuf.getInt (ArrowBuf.java:424)
|        at UnionListReader.setPosition (UnionListReader.java:64)
|        at (#14:1) 

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
So maybe Java isn't wrong then (though, the exception message could be much better!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant