Skip to content

Truncate Parquet page data page statistics #7555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 3, 2025

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 28, 2025

Which issue does this PR close?

Enables workaround for #7489

Rationale for this change

When WriterProperties::statistics_truncate_length is set, the column chunk statistics are truncated, but the page statistics are not. This can lead to very large page headers that blow up some readers.

What changes are included in this PR?

Data Page Header statistics are now truncated as well.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 28, 2025
@etseidl
Copy link
Contributor Author

etseidl commented May 28, 2025

Leaving as draft until tests can be added.

@etseidl etseidl marked this pull request as ready for review May 28, 2025 22:28
@tustvold
Copy link
Contributor

@etseidl
Copy link
Contributor Author

etseidl commented May 29, 2025

This functionality I think already exists under a different option? - https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_column_index_truncate_length

AFAICT that option is solely for the min and max in the column index. But I will verify that. This PR is for the statistics embedded in the page header. I believe it was these stats blowing up the arrow-cpp reader in the linked issue.

Edit: verified that the added test fails if column_index_truncate_length is 2 and statistics_truncate_length is None.

@tustvold
Copy link
Contributor

Aah yes sorry, read things too quickly, carry on

@alamb
Copy link
Contributor

alamb commented May 30, 2025

I am doing some tests of this PR but I am likely to run out of time today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this PR is going to substantially improve the metadata size of anyone writing large strings

TIL there are actually statistics in the DataPages, which are NOT the same as the statistics stored in the ColumnIndex NOR the same as the "Page Index" (sorry was behind on this)

I tested this PR writing some large data (program below) and setting statistics truncation to 64.

Before this PR the file was 4MB:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/test_page_stats$ du -s -h output.parquet
4.0M	output.parquet

After this PR the same program wrote 2MB:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/test_page_stats$ du -s -h output.parquet
2.0M	output.parquet

It took me a while to realize that there was another potential copy of statistics in the data page header (that is different than is what is in the ColumnChunk metadata). arrow-rs writes these statistics but doesn't have an API to read them

I am struggling to figure out where these page statistics are written!

Those seem to be correctly truncated (to 64 bytes)

Test Program

use std::io::BufWriter;
use std::sync::Arc;
use arrow::array::{ArrayRef, RecordBatch, StringViewArray};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;

fn main() {

    let output= std::fs::File::create("output.parquet").unwrap();
    let mut output = BufWriter::new(output);

    let batch = make_batch('a');
    let props = WriterProperties::builder()
        .set_max_row_group_size(1)
        .set_statistics_truncate_length(Some(64))
        .build();

    let mut writer = ArrowWriter::try_new(&mut output, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).unwrap();

    for char in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] {
        let batch = make_batch(char);
        writer.write(&batch).unwrap();
    }
    writer.close().unwrap();
}

// Makes a batch with long string values for testing purposes.
fn make_batch(val: char) -> RecordBatch {
    let col = Arc::new(StringViewArray::from_iter_values(
        [val.to_string().repeat(100000)]
    )) as ArrayRef;
    RecordBatch::try_from_iter([("col", col)]).unwrap()
}

@alamb
Copy link
Contributor

alamb commented May 31, 2025

BTW I am pretty sure @adriangb was showing me some data the other day with relatively large metadata. I think this PR will substantially reduce the size of the metadata for his files

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb alamb changed the title Truncate Parquet page statistics Truncate Parquet page data page statistics May 31, 2025
@alamb
Copy link
Contributor

alamb commented May 31, 2025

BTW I also filed a ticket to discuss a way to disable writing the redundant copy of statistics to the data page header statistics entirely:

@alamb
Copy link
Contributor

alamb commented Jun 2, 2025

I took the liberty of merging up from main to resolve some conflicts and tweaked the docs a bit

@alamb alamb merged commit 0ae9f66 into apache:main Jun 3, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 3, 2025

🚀

@etseidl etseidl deleted the truncate_page_stats branch June 3, 2025 16:03
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 20, 2025
@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

label_issue.py automatically added labels {'enhancement'} from #7594

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Jun 20, 2025
@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

label_issue.py automatically added labels {'next-major-release'} from #7594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement Any new improvement worthy of a entry in the changelog next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

max_statistics_truncate_length is ignored when writing statistics to data page headers
3 participants