Skip to content

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

@tustvold

Description

@tustvold

Describe the bug

Originally reported in apache/datafusion#1441 and encountered again in #1110, ParquetFileArrowReader appears to read incorrect data for string columns that contain nulls.

In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups.

To Reproduce

Read simple_strings.parquet.zip with the following code

#[test]
    fn test_read_strings() {
        let testdata = arrow::util::test_util::parquet_test_data();
        let path = format!("{}/simple_strings.parquet", testdata);
        let parquet_file_reader =
            SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
        let record_batch_reader = arrow_reader
            .get_record_reader(60)
            .expect("Failed to read into array!");

        let batches = record_batch_reader
            .collect::<arrow::error::Result<Vec<_>>>()
            .unwrap();

        assert_eq!(batches.len(), 1);
        let batch = batches.into_iter().next().unwrap();
        assert_eq!(batch.num_rows(), 6);

        let strings = batch
            .column(0)
            .as_any()
            .downcast_ref::<StringArray>()
            .unwrap();

        let strings: Vec<_> = strings.iter().collect();

        assert_eq!(
            &strings,
            &[
                None,
                Some("-1685637712"),
                Some("512814980"),
                Some("868743207"),
                None,
                Some("-1001940778")
            ]
        )
    }

Fails with

thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 'assertion failed: `(left == right)`
  left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), None, Some("868743207")]`,
 right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9

For comparison

$ python
> import duckdb
> duckdb.query("select * from 'simple_strings.parquet'").fetchall()
[(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), ('-1001940778',)]

The file consists of two row groups, each with 3 rows and was generated using #1110

Expected behavior

The test should pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions