-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reading StringViewArray
by default from Parquet
#12092
Conversation
StringViewArray
by default from Parquet
This comment was marked as outdated.
This comment was marked as outdated.
bc5d7f7
to
27f7d1e
Compare
This comment was marked as outdated.
This comment was marked as outdated.
27f7d1e
to
ce5470d
Compare
datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs
Outdated
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
2275216
to
e8f7384
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Benchmarks results clickbench_1:Is slower due to #12771
|
clickbench_extendedLooking good here
|
clickbench_partitionedSlowing down also due to #12509 / #12788
|
Current status: #11682 (comment) |
6500361
to
5b67afe
Compare
My plan is to pull the changes from the following PRs into this PR and rerun the overall perf test
Hopefully that will show the performance we need. Then we just need to get the various PRs actually merged and we can do this 😅 |
5b67afe
to
62a738e
Compare
62a738e
to
dda0a9a
Compare
dda0a9a
to
4a57773
Compare
Here are the current performance results: 🚀 Basically 10% faster across all queries and no slow downs 👏 @Rachelint @jayzhan211 @XiangpengHao and many many others.
Also for partitioned:
|
Next steps are to get the various PRs merged, get this one ready, and then write all about it. |
820b9ab
to
6112175
Compare
6112175
to
cbdc592
Compare
I made a new PR as this one has lots of now irrelevant historical context New PR here: #13101 |
Draft as it builds on:
53.1.0
/ fix clippy #12724binary_as_string
parquet option, upgrade to arrow/parquet53.2.0
#12816 @goldmedalGroupColumn
support forStringView
/ByteView
(faster grouping performance) #12809 from @Rachelinthits_partitioned
Which issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArray
is significantly faster thanStringArray
. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
What changes are included in this PR?
schema_force_view_types
to trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option
Context
@XiangpengHao debugged these tests previously using #11862