-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
Description
Describe the bug
ComplexObjectArrayReader does not use RecordReader and consequently does not correctly delimit semantic records when reading, in particular it may yield values that truncate a row part way through. This will in turn cause the parent ListArrayReader to error out as the repetition levels will not be consistent
To Reproduce
fn test_decimal_list() {
let decimals = Decimal128Array::from_iter_values([1, 2, 3, 4, 5, 6, 7, 8]);
// [[], [1], [2, 3], null, [4], null, [6, 7, 8]]
let data = ArrayDataBuilder::new(ArrowDataType::List(Box::new(Field::new(
"item",
decimals.data_type().clone(),
false,
))))
.len(7)
.add_buffer(Buffer::from_iter([0_i32, 0, 1, 3, 3, 4, 5, 8]))
.null_bit_buffer(Some(Buffer::from(&[0b01010111])))
.child_data(vec![decimals.into_data()])
.build()
.unwrap();
let written = RecordBatch::try_from_iter([(
"list",
Arc::new(ListArray::from(data)) as ArrayRef,
)])
.unwrap();
let mut buffer = Vec::with_capacity(1024);
let mut writer =
ArrowWriter::try_new(&mut buffer, written.schema(), None).unwrap();
writer.write(&written).unwrap();
writer.close().unwrap();
let read = ParquetFileArrowReader::try_new(Bytes::from(buffer))
.unwrap()
.get_record_reader(3)
.unwrap()
.collect::<ArrowResult<Vec<_>>>()
.unwrap();
assert_eq!(&written.slice(0, 3), &read[0]);
assert_eq!(&written.slice(3, 3), &read[1]);
assert_eq!(&written.slice(6, 1), &read[2]);
}
Results in
ParquetError("Parquet error: first repetition level of batch must be 0")
Expected behavior
We should support reading these nested types.
Additional context
#1661 tracks removing this ArrayReader as it is buggy, complex, and not really needed anymore