-
Notifications
You must be signed in to change notification settings - Fork 1k
perf: Speed up Parquet file writing (10%, back to speed of 56) #8786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl
I verified this is about 10% faster in the tpchgen-cli benchmarl
A further optimization would be to change ColumnOrder::get_sort_order to take an Option<&LogicalType>, but that is a breaking API change.
I tried this out and didn't see any noticable performance change in my test, but I will make a PR anyways to discuss
|
I wonder if we should add a comment to |
I did add a note to the getters that the value is cloned. I just wonder do we want to deprecate the cloning getters? It would be nice if eventually |
+1 for |
What would the mechanics of such a change be? Do we deprecate with a note that the API will change in a future release? Or do we just change the signature in |
I think deprecate with a note is a good idea To be clear:
|
|
|
Given #8789 is open I don't think this needs any further work. Thanks again for the quick reviews. |
…e#8786) # Which issue does this PR close? - Partial fix for apache#8783. # Rationale for this change An appreciable slowdown in Parquet writing was noticed. At least part of the slowdown seems to stem from changes to `basic::LogicalType` which caused cloning the enum to take make longer than previously. # What changes are included in this PR? This adds a new `logical_type_ref` call to `BasicTypeInfo` and `ColumnDescriptor`. Unlike the existing `logical_type` which returns a cloned `Option<LogicalType>`, the new methods return `Option<&LogicalType>`. This new function is used in place of `logical_type` internally. # Are these changes tested? Should be covered by existing tests. # Are there any user-facing changes? No. A further optimization would be to change `ColumnOrder::get_sort_order` to take an `Option<&LogicalType>`, but that is a breaking API change.
…`, deprecate `get_logical_type` (#8789) # Which issue does this PR close? - Builds on #8786 from @etseidl # Rationale for this change @etseidl suggested: > A further optimization would be to change ColumnOrder::get_sort_order to take an Option<&LogicalType>, but that is a breaking API change. # What changes are included in this PR? 1. Deprecate `ColumnOrder::get_sort_order` 2. Add ColumnOrder::sort_order_for_type` that returns the same value # Are these changes tested? Yes by CI Note I was not able to observe any changes in benchmark performance with this, so I am not sure it it worth doing # Are there any user-facing changes? There is a new API but I think there not many people use this (low level) API so the disruption should be minimal --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
An appreciable slowdown in Parquet writing was noticed. At least part of the slowdown seems to stem from changes to
basic::LogicalTypewhich caused cloning the enum to take make longer than previously.What changes are included in this PR?
This adds a new
logical_type_refcall toBasicTypeInfoandColumnDescriptor. Unlike the existinglogical_typewhich returns a clonedOption<LogicalType>, the new methods returnOption<&LogicalType>. This new function is used in place oflogical_typeinternally.Are these changes tested?
Should be covered by existing tests.
Are there any user-facing changes?
No.
A further optimization would be to change
ColumnOrder::get_sort_orderto take anOption<&LogicalType>, but that is a breaking API change.