-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Collect Parquet dictionary binary as view #17475
Conversation
d1afbc1
to
23680e4
Compare
This optimizes how Parquet dictionary over binary is collected. Now, instead of pushing the items one at the time into a buffer. The dictionary is used as a buffer and views are made into that buffer. This should not only speed up the Parquet decoder, but should also reduce memory consumption and speed up subsequent operations. I did a small benchmark, but this does not really mean much. ``` Benchmark 1: After Optimization Time (mean ± σ): 2.007 s ± 0.005 s [User: 1.712 s, System: 0.523 s] Range (min … max): 2.000 s … 2.013 s 10 runs Benchmark 2: Before Optimization Time (mean ± σ): 2.285 s ± 0.009 s [User: 1.956 s, System: 0.595 s] Range (min … max): 2.274 s … 2.306 s 10 runs Summary After Optimization ran 1.14 ± 0.01 times faster than Before Optimization ```
23680e4
to
faff0ca
Compare
CodSpeed Performance ReportMerging #17475 will improve performances by 26.82%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17475 +/- ##
==========================================
+ Coverage 80.46% 80.50% +0.03%
==========================================
Files 1483 1483
Lines 194832 195122 +290
Branches 2770 2781 +11
==========================================
+ Hits 156767 157078 +311
+ Misses 37556 37533 -23
- Partials 509 511 +2 ☔ View full report in Codecov by Sentry. |
14% on a whole read is quite a lot! Considering you have decompression, decoding etc. |
let buffer_idx = if max_length <= View::MAX_INLINE_SIZE as usize { | ||
0 | ||
} else { | ||
values.push_buffer(page_dict.values().clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice. So we don't have to go through the builder, but immediately use the buffer. 👍
This optimizes how Parquet dictionary over binary is collected. Now, instead of pushing the items one at the time into a buffer. The dictionary is used as a buffer and views are made into that buffer. This should not only speed up the Parquet decoder, but should also reduce memory consumption and speed up subsequent operations.
I did a small benchmark with the Wikipedia dataset (collect
a.parquet
once), but this does not really mean much.