-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
I'm trying to read a parquet file containing a large number of image bytes in a columns with MapType. Unfortunately, this leads to an error "Not all children array length are the same!", but only if I use a RowSelection! If I omit the RowSelection and let the file be traversed normally, my reproduction test succeeds.
To Reproduce
mod tests {
use parquet::arrow::arrow_reader::{ArrowReaderBuilder, RowSelection, RowSelector};
use std::fs::File;
use std::path::PathBuf;
#[test]
fn validate_issue() {
pub fn row_selection_from_indices(indices: &[usize]) -> RowSelection {
let mut selectors = Vec::new();
let mut last_end = 0;
for &idx in indices {
if idx > last_end {
selectors.push(RowSelector::skip(idx - last_end));
}
selectors.push(RowSelector::select(1));
last_end = idx + 1;
}
selectors.into()
}
let indices = vec![352, 955];
let arrow_reader = ArrowReaderBuilder::try_new(
File::open(PathBuf::from(
"issue_file.parquet",
))
.unwrap(),
)
.unwrap();
let mut batch_reader_builder = arrow_reader;
batch_reader_builder = batch_reader_builder.with_row_groups(vec![99]);
batch_reader_builder =
batch_reader_builder.with_row_selection(row_selection_from_indices(indices.as_slice())); // Removing this lets the test suceed again!
let batch_reader = batch_reader_builder.build().unwrap();
for item in batch_reader {
item.unwrap();
}
}
}
=> (the debug statements were added my be)
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9] children_array_len = 3
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9] children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
3,
3,
3,
3,
3,
3,
3,
]
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9] children_array_len = 3
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9] children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
3,
3,
]
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9] children_array_len = 2
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9] children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
2, // <-- Only the first array seems to be of length 2, all others have length 3
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
]
called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error: Not all children array length are the same!")
Expected behavior
Iteration through the parquet file should work without problem, regardless of whether somebody uses RowSelection or not.
Additional context
Happens with arrow-rs 57.1.0, 57.2.0 and in this specific report I used commit fb77501.
I unfortunately can't give you the reproduction file, as it contains tons of confidential stuff, but I shared as much parquet-viewer output as possible. Most probably this is about the image_data map, specifically the image_bytes values column?
