-
Couldn't load subscription status.
- Fork 1k
Closed
Labels
Description
Describe the bug
Originally reported in apache/datafusion#1441 and encountered again in #1110, ParquetFileArrowReader appears to read incorrect data for string columns that contain nulls.
In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.
To Reproduce
Read simple_strings.parquet.zip with the following code
#[test]
fn test_read_strings() {
let testdata = arrow::util::test_util::parquet_test_data();
let path = format!("{}/simple_strings.parquet", testdata);
let parquet_file_reader =
SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
let record_batch_reader = arrow_reader
.get_record_reader(60)
.expect("Failed to read into array!");
let batches = record_batch_reader
.collect::<arrow::error::Result<Vec<_>>>()
.unwrap();
assert_eq!(batches.len(), 1);
let batch = batches.into_iter().next().unwrap();
assert_eq!(batch.num_rows(), 6);
let strings = batch
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.unwrap();
let strings: Vec<_> = strings.iter().collect();
assert_eq!(
&strings,
&[
None,
Some("-1685637712"),
Some("512814980"),
Some("868743207"),
None,
Some("-1001940778")
]
)
}
Fails with
thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 'assertion failed: `(left == right)`
left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), None, Some("868743207")]`,
right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9
For comparison
$ python
> import duckdb
> duckdb.query("select * from 'simple_strings.parquet'").fetchall()
[(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), ('-1001940778',)]
The file consists of two row groups, each with 3 rows and was generated using #1110
Expected behavior
The test should pass