Closed
Description
Describe the bug
- As @jonded94 found in Files containing binary data with >=8_388_855 bytes per row written with
arrow-rs
can't be read withpyarrow
#7489 - And @etseidl debugged in Truncate Parquet page data page statistics #7555
When writing long string values into string columns in parqet, we expect the WriterProperties::max_statistics_truncate_length
to be apply and reduce their size
This property currently correctly truncates statistics written to the ColumnChunkMetadata but NOT the statistics written to the data page headers.
To Reproduce
use std::io::BufWriter;
use std::sync::Arc;
use arrow::array::{ArrayRef, RecordBatch, StringViewArray};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
fn main() {
let output= std::fs::File::create("output.parquet").unwrap();
let mut output = BufWriter::new(output);
let batch = make_batch('a');
let props = WriterProperties::builder()
.set_max_row_group_size(1)
.set_statistics_truncate_length(Some(64))
.build();
let mut writer = ArrowWriter::try_new(&mut output, batch.schema(), Some(props)).unwrap();
writer.write(&batch).unwrap();
for char in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] {
let batch = make_batch(char);
writer.write(&batch).unwrap();
}
writer.close().unwrap();
}
// Makes a batch with long string values for testing purposes.
fn make_batch(val: char) -> RecordBatch {
let col = Arc::new(StringViewArray::from_iter_values(
[val.to_string().repeat(100000)]
)) as ArrayRef;
RecordBatch::try_from_iter([("col", col)]).unwrap()
}
The resulting data page headers have statistics
Expected behavior
I expect the data page headers to be truncated to 64 bytes
Additional context