-
Notifications
You must be signed in to change notification settings - Fork 958
Truncate Parquet page data page statistics #7555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Leaving as draft until tests can be added. |
This functionality I think already exists under a different option? - https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_column_index_truncate_length |
AFAICT that option is solely for the min and max in the column index. But I will verify that. This PR is for the statistics embedded in the page header. I believe it was these stats blowing up the arrow-cpp reader in the linked issue. Edit: verified that the added test fails if |
Aah yes sorry, read things too quickly, carry on |
I am doing some tests of this PR but I am likely to run out of time today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- I think this PR is going to substantially improve the metadata size of anyone writing large strings
TIL there are actually statistics in the DataPages, which are NOT the same as the statistics stored in the ColumnIndex NOR the same as the "Page Index" (sorry was behind on this)
I tested this PR writing some large data (program below) and setting statistics truncation to 64.
Before this PR the file was 4MB:
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/test_page_stats$ du -s -h output.parquet
4.0M output.parquet
After this PR the same program wrote 2MB:
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/test_page_stats$ du -s -h output.parquet
2.0M output.parquet
It took me a while to realize that there was another potential copy of statistics in the data page header (that is different than is what is in the ColumnChunk metadata). arrow-rs writes these statistics but doesn't have an API to read them
I am struggling to figure out where these page statistics are written!
Those seem to be correctly truncated (to 64 bytes)
Test Program
use std::io::BufWriter;
use std::sync::Arc;
use arrow::array::{ArrayRef, RecordBatch, StringViewArray};
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
fn main() {
let output= std::fs::File::create("output.parquet").unwrap();
let mut output = BufWriter::new(output);
let batch = make_batch('a');
let props = WriterProperties::builder()
.set_max_row_group_size(1)
.set_statistics_truncate_length(Some(64))
.build();
let mut writer = ArrowWriter::try_new(&mut output, batch.schema(), Some(props)).unwrap();
writer.write(&batch).unwrap();
for char in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] {
let batch = make_batch(char);
writer.write(&batch).unwrap();
}
writer.close().unwrap();
}
// Makes a batch with long string values for testing purposes.
fn make_batch(val: char) -> RecordBatch {
let col = Arc::new(StringViewArray::from_iter_values(
[val.to_string().repeat(100000)]
)) as ArrayRef;
RecordBatch::try_from_iter([("col", col)]).unwrap()
}
BTW I am pretty sure @adriangb was showing me some data the other day with relatively large metadata. I think this PR will substantially reduce the size of the metadata for his files |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
BTW I also filed a ticket to discuss a way to disable writing the redundant copy of statistics to the data page header statistics entirely: |
I took the liberty of merging up from main to resolve some conflicts and tweaked the docs a bit |
🚀 |
|
|
Which issue does this PR close?
Enables workaround for #7489
max_statistics_truncate_length
is ignored when writing statistics to data page headers #7579Rationale for this change
When
WriterProperties::statistics_truncate_length
is set, the column chunk statistics are truncated, but the page statistics are not. This can lead to very large page headers that blow up some readers.What changes are included in this PR?
Data Page Header statistics are now truncated as well.
Are there any user-facing changes?
No