-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reading StringViewArray
by default from Parquet (8% improvement for entire ClickBench suite)
#13101
Enable reading StringViewArray
by default from Parquet (8% improvement for entire ClickBench suite)
#13101
Conversation
cbdc592
to
c0ff96f
Compare
c0ff96f
to
c95b870
Compare
StringViewArray
by default from ParquetStringViewArray
by default from Parquet (8% improvement for entire ClickBench suite)
I think it's interesting to run some more Parquet benchmarks as well to detect any regression. It looks like query 18 of TPC-H is still a tiny bit slower maybe (ran it a few times in a row). The rest is as fast or faster:
|
Btw - I don't think this should hold off the merge / release, but would be good to track/note any regressions, however small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably makes sense to run some more benchmarks just to be sure
/// If true, will use StringView/BinaryViewArray instead of String/BinaryArray | ||
/// when reading ParquetFiles | ||
#[structopt(long)] | ||
pub force_view_types: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this (or a differently-named) flag as a kill-switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 There is a kill switch (in the description of this PR)
set datafusion.execution.parquet.schema_force_view_types = false;
0 row(s) fetched.
Elapsed 0.000 seconds.
This particular code is for the benchmark drivers and I don't think it is super valuable to retain the benchmark in both configurations
I will do so |
My plan for this PR is to hedge against disruptions by making a stable DataFusion 42.2.0 and then merging this PR into the main for inclusion in #13065 I will review the benchmark results again and look at what is going on with TPCH Q18 |
I filed this one It may be an instance that Could help with |
This PR / project has been outstanding long enough and I desparately need to close off concurrent projects. Let's merge it in and keep iterating on main |
Thanks again @findepi @Dandandan (and @Rachelint and @goldmedal and @XiangpengHao and @jayzhan211 and so many others) |
Replacement for #12092 which had too much history on it
Which issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArray
is significantly faster thanStringArray
. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
(note I believe the changes for Q1 and Q2 are noise (there is no corresponding changes for the
clickbench_partition
ed table)Details for `clickbench`
What changes are included in this PR?
schema_force_view_types
to trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option