Skip to content

Stop writing statistics to Parquet page headers by default, add option to enable #7594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jun 2, 2025

Which issue does this PR close?

Rationale for this change

Reduces metadata bloat by not writing redundant statistics to each Parquet page header.

What changes are included in this PR?

Changes the meaning of EnabledStatistics::Page. Currently this level means statistics will be written to the column chunk, page header, and column index. With this PR Page now means writing to the column chunk and column index. Writing to the page header can be enabled using an added write_page_header_statistics writer option.

Also adds some command line switches to the parquet-rewrite tool.

Are there any user-facing changes?

No breaking API changes, but an added option and behavior change.

Specifically, statistics are no longer written to data Page headers by default. If you want them (not common) you will have to explictly enable doing so

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 2, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Jun 2, 2025

Setting this to draft for now since it depends on #7555.

@etseidl etseidl added the enhancement Any new improvement worthy of a entry in the changelog label Jun 2, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I went over this PR carefully and I think it looks really nice.

I couldn't help myself so I also merged up from main to resolve a conflict, added an explicit test for writing page statistics, and tweaked some docs.

I think this is going to make a non trivial difference for people who store data using parquet-rs and enable page level statistics. THANK YOU

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Jun 9, 2025
@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

I think we should wait for the next major release before merging this PR

@alamb alamb added the api-change Changes to the arrow API label Jun 9, 2025
@alamb alamb changed the title Add option to control writing of statistics to Parquet page headers Stop writing statistics to Parquet page headers by default, add option to enable Jun 9, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Jun 9, 2025

I couldn't help myself so I also merged up from main to resolve a conflict, added an explicit test for writing page statistics, and tweaked some docs.

🤣 Thanks @alamb! Changes look great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some way to avoid writing redundant statistics into data page headers
2 participants