Skip to content

max_statistics_truncate_length is ignored when writing statistics to data page headers #7579

Closed
@alamb

Description

@alamb

Describe the bug

When writing long string values into string columns in parqet, we expect the WriterProperties::max_statistics_truncate_length to be apply and reduce their size

This property currently correctly truncates statistics written to the ColumnChunkMetadata but NOT the statistics written to the data page headers.

To Reproduce

use std::io::BufWriter;
use std::sync::Arc;
use arrow::array::{ArrayRef, RecordBatch, StringViewArray};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;

fn main() {

    let output= std::fs::File::create("output.parquet").unwrap();
    let mut output = BufWriter::new(output);

    let batch = make_batch('a');
    let props = WriterProperties::builder()
        .set_max_row_group_size(1)
        .set_statistics_truncate_length(Some(64))
        .build();

    let mut writer = ArrowWriter::try_new(&mut output, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).unwrap();

    for char in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] {
        let batch = make_batch(char);
        writer.write(&batch).unwrap();
    }
    writer.close().unwrap();
}

// Makes a batch with long string values for testing purposes.
fn make_batch(val: char) -> RecordBatch {
    let col = Arc::new(StringViewArray::from_iter_values(
        [val.to_string().repeat(100000)]
    )) as ArrayRef;
    RecordBatch::try_from_iter([("col", col)]).unwrap()
}

The resulting data page headers have statistics

Expected behavior
I expect the data page headers to be truncated to 64 bytes

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugenhancementAny new improvement worthy of a entry in the changelognext-major-releasethe PR has API changes and it waiting on the next major versionparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions