Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reading StringViewArray by default from Parquet (8% improvement for entire ClickBench suite) #13101

Merged
merged 4 commits into from
Oct 30, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 24, 2024

Replacement for #12092 which had too much history on it

Which issue does this PR close?

Closes #11682

Rationale for this change

Reading data as StringViewArray is significantly faster than StringArray. We have been testing this behind a feature flag but it is now stable enough to enable by default.

See blog post #11603:

Benchmark Results

(note I believe the changes for Q1 and Q2 are noise (there is no corresponding changes for the clickbench_partitioned table)

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_enable_string_view_by_def… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.67ms │                           0.66ms │     no change │
│ QQuery 1     │    69.29ms │                          73.35ms │  1.06x slower │
│ QQuery 2     │   122.32ms │                         132.71ms │  1.08x slower │
│ QQuery 3     │   133.67ms │                         139.59ms │     no change │
│ QQuery 4     │   996.24ms │                         982.49ms │     no change │
│ QQuery 5     │  1116.44ms │                        1142.41ms │     no change │
│ QQuery 6     │    66.96ms │                          65.39ms │     no change │
│ QQuery 7     │    79.18ms │                          76.94ms │     no change │
│ QQuery 8     │  1354.44ms │                        1361.98ms │     no change │
│ QQuery 9     │  1357.76ms │                        1369.16ms │     no change │
│ QQuery 10    │   445.06ms │                         338.14ms │ +1.32x faster │
│ QQuery 11    │   492.56ms │                         375.15ms │ +1.31x faster │
│ QQuery 12    │  1251.03ms │                        1162.42ms │ +1.08x faster │
│ QQuery 13    │  1893.90ms │                        1696.30ms │ +1.12x faster │
│ QQuery 14    │  1378.55ms │                        1205.04ms │ +1.14x faster │
│ QQuery 15    │  1146.10ms │                        1128.91ms │     no change │
│ QQuery 16    │  2582.17ms │                        2570.14ms │     no change │
│ QQuery 17    │  2406.53ms │                        2373.54ms │     no change │
│ QQuery 18    │  4982.87ms │                        5235.21ms │  1.05x slower │
│ QQuery 19    │   124.21ms │                         126.42ms │     no change │
│ QQuery 20    │  1672.20ms │                        1417.33ms │ +1.18x faster │
│ QQuery 21    │  2100.85ms │                        1803.47ms │ +1.16x faster │
│ QQuery 22    │  5090.64ms │                        4808.47ms │ +1.06x faster │
│ QQuery 23    │ 11941.46ms │                       10767.45ms │ +1.11x faster │
│ QQuery 24    │   799.87ms │                         715.60ms │ +1.12x faster │
│ QQuery 25    │   691.09ms │                         629.54ms │ +1.10x faster │
│ QQuery 26    │   859.81ms │                         792.57ms │ +1.08x faster │
│ QQuery 27    │  2567.38ms │                        2151.88ms │ +1.19x faster │
│ QQuery 28    │ 14479.43ms │                       13706.54ms │ +1.06x faster │
│ QQuery 29    │   564.61ms │                         565.26ms │     no change │
│ QQuery 30    │  1228.80ms │                        1212.04ms │     no change │
│ QQuery 31    │  1283.12ms │                        1236.55ms │     no change │
│ QQuery 32    │  4218.28ms │                        4272.09ms │     no change │
│ QQuery 33    │  5355.74ms │                        4224.61ms │ +1.27x faster │
│ QQuery 34    │  5335.13ms │                        4244.13ms │ +1.26x faster │
│ QQuery 35    │  1824.93ms │                        1814.16ms │     no change │
│ QQuery 36    │   313.35ms │                         280.79ms │ +1.12x faster │
│ QQuery 37    │   216.40ms │                         192.53ms │ +1.12x faster │
│ QQuery 38    │   200.74ms │                         195.93ms │     no change │
│ QQuery 39    │   809.27ms │                         548.34ms │ +1.48x faster │
│ QQuery 40    │    85.54ms │                          83.31ms │     no change │
│ QQuery 41    │    79.56ms │                          78.84ms │     no change │
│ QQuery 42    │    92.83ms │                          89.91ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                             │ 83811.00ms │
│ Total Time (alamb_enable_string_view_by_default)   │ 77387.25ms │
│ Average Time (main_base)                           │  1949.09ms │
│ Average Time (alamb_enable_string_view_by_default) │  1799.70ms │
│ Queries Faster                                     │         19 │
│ Queries Slower                                     │          3 │
│ Queries with No Change                             │         21 │
└────────────────────────────────────────────────────┴────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_enable_string_view_by_def… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2802.77ms │                        2717.98ms │     no change │
│ QQuery 1     │   782.97ms │                         746.94ms │     no change │
│ QQuery 2     │  1623.97ms │                        1461.10ms │ +1.11x faster │
│ QQuery 3     │   765.69ms │                         794.06ms │     no change │
│ QQuery 4     │ 12436.70ms │                       12726.17ms │     no change │
│ QQuery 5     │ 19327.44ms │                       19004.62ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘
Details for `clickbench`

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_enable_string_view_by_def… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.26ms │                           2.25ms │     no change │
│ QQuery 1     │    38.13ms │                          37.92ms │     no change │
│ QQuery 2     │    92.84ms │                          95.31ms │     no change │
│ QQuery 3     │   102.48ms │                          99.37ms │     no change │
│ QQuery 4     │   929.25ms │                         933.00ms │     no change │
│ QQuery 5     │   953.94ms │                         965.80ms │     no change │
│ QQuery 6     │    34.01ms │                          33.60ms │     no change │
│ QQuery 7     │    42.93ms │                          41.04ms │     no change │
│ QQuery 8     │  1382.85ms │                        1376.23ms │     no change │
│ QQuery 9     │  1317.61ms │                        1355.96ms │     no change │
│ QQuery 10    │   339.22ms │                         307.64ms │ +1.10x faster │
│ QQuery 11    │   392.56ms │                         356.97ms │ +1.10x faster │
│ QQuery 12    │  1073.64ms │                        1001.23ms │ +1.07x faster │
│ QQuery 13    │  1593.84ms │                        1435.10ms │ +1.11x faster │
│ QQuery 14    │  1227.03ms │                        1082.83ms │ +1.13x faster │
│ QQuery 15    │  1073.62ms │                        1102.01ms │     no change │
│ QQuery 16    │  2467.55ms │                        2473.70ms │     no change │
│ QQuery 17    │  2290.24ms │                        2272.28ms │     no change │
│ QQuery 18    │  4886.16ms │                        5191.68ms │  1.06x slower │
│ QQuery 19    │    96.06ms │                          95.67ms │     no change │
│ QQuery 20    │  1736.41ms │                        1243.50ms │ +1.40x faster │
│ QQuery 21    │  2017.17ms │                        1509.37ms │ +1.34x faster │
│ QQuery 22    │  5174.08ms │                        2679.55ms │ +1.93x faster │
│ QQuery 23    │ 10438.02ms │                        8952.56ms │ +1.17x faster │
│ QQuery 24    │   596.82ms │                         525.86ms │ +1.13x faster │
│ QQuery 25    │   483.61ms │                         431.71ms │ +1.12x faster │
│ QQuery 26    │   652.77ms │                         592.18ms │ +1.10x faster │
│ QQuery 27    │  2641.38ms │                        1872.83ms │ +1.41x faster │
│ QQuery 28    │ 13735.83ms │                       12904.89ms │ +1.06x faster │
│ QQuery 29    │   526.65ms │                         527.54ms │     no change │
│ QQuery 30    │  1025.28ms │                        1015.91ms │     no change │
│ QQuery 31    │  1098.82ms │                        1070.44ms │     no change │
│ QQuery 32    │  4207.19ms │                        4335.48ms │     no change │
│ QQuery 33    │  5248.23ms │                        4019.20ms │ +1.31x faster │
│ QQuery 34    │  5206.59ms │                        4012.75ms │ +1.30x faster │
│ QQuery 35    │  1906.62ms │                        1910.65ms │     no change │
│ QQuery 36    │   275.92ms │                         231.12ms │ +1.19x faster │
│ QQuery 37    │   125.11ms │                          95.03ms │ +1.32x faster │
│ QQuery 38    │   149.72ms │                         139.72ms │ +1.07x faster │
│ QQuery 39    │   768.75ms │                         480.50ms │ +1.60x faster │
│ QQuery 40    │    55.02ms │                          56.71ms │     no change │
│ QQuery 41    │    47.78ms │                          48.55ms │     no change │
│ QQuery 42    │    63.76ms │                          63.84ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                             │ 78517.73ms │
│ Total Time (alamb_enable_string_view_by_default)   │ 68979.45ms │
│ Average Time (main_base)                           │  1825.99ms │
│ Average Time (alamb_enable_string_view_by_default) │  1604.17ms │
│ Queries Faster                                     │         20 │
│ Queries Slower                                     │          1 │
│ Queries with No Change                             │         22 │
└────────────────────────────────────────────────────┴────────────┘

What changes are included in this PR?

  1. Set schema_force_view_types to true

Are these changes tested?

Yes, by CI tests

Are there any user-facing changes?

  1. Faster reading of data from Parquet files

If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option

> set datafusion.execution.parquet.schema_force_view_types = false;
0 row(s) fetched.
Elapsed 0.000 seconds.

@alamb alamb marked this pull request as draft October 24, 2024 20:20
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Oct 24, 2024
@alamb alamb force-pushed the alamb/enable_string_view_by_default2 branch from cbdc592 to c0ff96f Compare October 25, 2024 12:50
@alamb alamb force-pushed the alamb/enable_string_view_by_default2 branch from c0ff96f to c95b870 Compare October 25, 2024 13:03
@alamb alamb marked this pull request as ready for review October 25, 2024 13:03
@github-actions github-actions bot removed the proto Related to proto crate label Oct 25, 2024
@alamb alamb changed the title Enable reading StringViewArray by default from Parquet Enable reading StringViewArray by default from Parquet (8% improvement for entire ClickBench suite) Oct 25, 2024
@Dandandan
Copy link
Contributor

I think it's interesting to run some more Parquet benchmarks as well to detect any regression.

It looks like query 18 of TPC-H is still a tiny bit slower maybe (ran it a few times in a row).

The rest is as fast or faster:

Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ enable_string_view_by_default2 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 119.79ms │                       124.32ms │     no change │
│ QQuery 2     │  60.37ms │                        57.00ms │ +1.06x faster │
│ QQuery 3     │  59.15ms │                        57.28ms │     no change │
│ QQuery 4     │  48.39ms │                        42.35ms │ +1.14x faster │
│ QQuery 5     │  81.52ms │                        83.75ms │     no change │
│ QQuery 6     │  22.36ms │                        22.13ms │     no change │
│ QQuery 7     │  98.51ms │                        95.09ms │     no change │
│ QQuery 8     │  81.93ms │                        79.63ms │     no change │
│ QQuery 9     │ 119.90ms │                       118.43ms │     no change │
│ QQuery 10    │ 117.68ms │                       108.71ms │ +1.08x faster │
│ QQuery 11    │  41.15ms │                        40.82ms │     no change │
│ QQuery 12    │  82.15ms │                        55.54ms │ +1.48x faster │
│ QQuery 13    │ 137.08ms │                       121.33ms │ +1.13x faster │
│ QQuery 14    │  41.28ms │                        38.38ms │ +1.08x faster │
│ QQuery 15    │  48.88ms │                        47.84ms │     no change │
│ QQuery 16    │  42.21ms │                        38.53ms │ +1.10x faster │
│ QQuery 17    │ 107.27ms │                       108.30ms │     no change │
│ QQuery 18    │ 156.68ms │                       170.70ms │  1.09x slower │
│ QQuery 19    │  80.49ms │                        67.76ms │ +1.19x faster │
│ QQuery 20    │  68.77ms │                        66.18ms │     no change │
│ QQuery 21    │ 125.50ms │                       128.51ms │     no change │
│ QQuery 22    │  34.73ms │                        35.04ms │     no change │
└──────────────┴──────────┴────────────────────────────────┴───────────────┘

@alamb alamb mentioned this pull request Oct 25, 2024
4 tasks
@Dandandan
Copy link
Contributor

Btw - I don't think this should hold off the merge / release, but would be good to track/note any regressions, however small.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to run some more benchmarks just to be sure

/// If true, will use StringView/BinaryViewArray instead of String/BinaryArray
/// when reading ParquetFiles
#[structopt(long)]
pub force_view_types: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this (or a differently-named) flag as a kill-switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 There is a kill switch (in the description of this PR)

set datafusion.execution.parquet.schema_force_view_types = false;
0 row(s) fetched.
Elapsed 0.000 seconds.

This particular code is for the benchmark drivers and I don't think it is super valuable to retain the benchmark in both configurations

@alamb
Copy link
Contributor Author

alamb commented Oct 25, 2024

Probably makes sense to run some more benchmarks just to be sure

I will do so

@alamb
Copy link
Contributor Author

alamb commented Oct 29, 2024

My plan for this PR is to hedge against disruptions by making a stable DataFusion 42.2.0 and then merging this PR into the main for inclusion in #13065

I will review the benchmark results again and look at what is going on with TPCH Q18

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

Btw - I don't think this should hold off the merge / release, but would be good to track/note any regressions, however small.

I filed this one

It may be an instance that

Could help with

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

This PR / project has been outstanding long enough and I desparately need to close off concurrent projects. Let's merge it in and keep iterating on main

@alamb alamb merged commit 2d7892b into apache:main Oct 30, 2024
25 checks passed
@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

Thanks again @findepi @Dandandan (and @Rachelint and @goldmedal and @XiangpengHao and @jayzhan211 and so many others)

@alamb alamb deleted the alamb/enable_string_view_by_default2 branch October 30, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable reading StringView by default from Parquet (schema_force_string_view) by default
3 participants