-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report uncompressed column size as a statistic #6848
base: main
Are you sure you want to change the base?
Conversation
Could you add tests for this to Otherwise looks reasonable to me. |
Looks like Date64 handling has changed since you branched, which is throwing off But thanks! That was fast turnaround. |
Thank you for the fast review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1432,6 +1432,24 @@ impl<'a> StatisticsConverter<'a> { | |||
Ok(UInt64Array::from_iter(null_counts)) | |||
} | |||
|
|||
/// Extract the uncompressed sizes from row group statistics in [`RowGroupMetaData`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be worth mentioning here that this is the uncompressed size of the parquet data page
Aka this is what is reported here
I think as written it might be confused with the uncompressed size after decoding to arrow, which will likely be quite different (and substantially larger)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Which is why I wanted this added to Parquet, to allow better estimation of decoded sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I forgot that you added this:
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20unencoded_byte_array_data_bytes&type=code
So @AdamGS what do you think about updating this PR to return the unencoded_byte_array_data_bytes
field instead of the decompressed page size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, I'll probably have it tomorrow/later today depending on my jetlag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bear in mind that unencoded_byte_array_data_bytes
is only for byte array data, and does not include any overhead introduced by Arrow (offsets array, etc). For fixed width types it would be sufficient to know the total number of values encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current plan is to report unencoded_byte_array_data_bytes
for BYTE_ARRAY
columns, and width * num_values for the others in my mind is the amount of "information stored".
Consumers like DataFusion can then add any known overheads (like Arrow offset arrays etc).
The other option I can think of is reporting the value size, and letting callers do any arithmetic the find useful (like multiplying by number of values etc.), would love to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion:
- the parquet crate's API already exposes the
unencoded_byte_array_data_bytes
metric so users can do arithmetic they want so simply addingunencoded_byte_array_data_bytes
to theStatisticsConverter
is not very helpful (if anything returningunencoded_byte_array_data_bytes
as an arrow array makes the values harder to use) - Something I do think could potentially be valuable is some way to calculate the memory required for certain amounts of arrow data (e.g. a 100 row Int64 array) but that is probably worth its own ticket / discussion
I suggest proceeding with apache/datafusion#7548 by adding code there first/ figuring out the real use case and then upstreaming (to arrow-rs) and common pattern that emerges
Which issue does this PR close?
Closes #6847.
Rationale for this change
Part of apache/datafusion#7548.
What changes are included in this PR?
Are there any user-facing changes?
Adds a new public function.