Skip to content

Error "Not all children array length are the same!" when decoding rows spanning across page boundaries in parquet file when using RowSelection #9370

@jonded94

Description

@jonded94

Describe the bug

I'm trying to read a parquet file containing a large number of image bytes in a columns with MapType. Unfortunately, this leads to an error "Not all children array length are the same!", but only if I use a RowSelection! If I omit the RowSelection and let the file be traversed normally, my reproduction test succeeds.

To Reproduce

mod tests {
    use parquet::arrow::arrow_reader::{ArrowReaderBuilder, RowSelection, RowSelector};
    use std::fs::File;
    use std::path::PathBuf;

    #[test]
    fn validate_issue() {
        pub fn row_selection_from_indices(indices: &[usize]) -> RowSelection {
            let mut selectors = Vec::new();
            let mut last_end = 0;

            for &idx in indices {
                if idx > last_end {
                    selectors.push(RowSelector::skip(idx - last_end));
                }
                selectors.push(RowSelector::select(1));
                last_end = idx + 1;
            }

            selectors.into()
        }
        let indices = vec![352, 955];

        let arrow_reader = ArrowReaderBuilder::try_new(
            File::open(PathBuf::from(
                "issue_file.parquet",
            ))
            .unwrap(),
        )
        .unwrap();

        let mut batch_reader_builder = arrow_reader;
        batch_reader_builder = batch_reader_builder.with_row_groups(vec![99]);
        batch_reader_builder =
            batch_reader_builder.with_row_selection(row_selection_from_indices(indices.as_slice()));  // Removing this lets the test suceed again!

        let batch_reader = batch_reader_builder.build().unwrap();

        for item in batch_reader {
            item.unwrap();
        }    
    }
}

=> (the debug statements were added my be)

[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9] children_array_len = 3
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9] children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
    3,
    3,
    3,
    3,
    3,
    3,
    3,
]
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9] children_array_len = 3
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9] children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
    3,
    3,
]
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9] children_array_len = 2
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9] children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
    2,  // <-- Only the first array seems to be of length 2, all others have length 3
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
    3,
]
called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error: Not all children array length are the same!")

Expected behavior
Iteration through the parquet file should work without problem, regardless of whether somebody uses RowSelection or not.

Additional context
Happens with arrow-rs 57.1.0, 57.2.0 and in this specific report I used commit fb77501.

I unfortunately can't give you the reproduction file, as it contains tons of confidential stuff, but I shared as much parquet-viewer output as possible. Most probably this is about the image_data map, specifically the image_bytes values column?

Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions