perf: Collect Parquet dictionary binary as view #17475

coastalwhite · 2024-07-07T13:16:19Z

This optimizes how Parquet dictionary over binary is collected. Now, instead of pushing the items one at the time into a buffer. The dictionary is used as a buffer and views are made into that buffer. This should not only speed up the Parquet decoder, but should also reduce memory consumption and speed up subsequent operations.

I did a small benchmark with the Wikipedia dataset (collect a.parquet once), but this does not really mean much.

Benchmark 1: After Optimization
  Time (mean ± σ):      2.007 s ±  0.005 s    [User: 1.712 s, System: 0.523 s]
  Range (min … max):    2.000 s …  2.013 s    10 runs

Benchmark 2: Before Optimization
  Time (mean ± σ):      2.285 s ±  0.009 s    [User: 1.956 s, System: 0.595 s]
  Range (min … max):    2.274 s …  2.306 s    10 runs

Summary
  After Optimization ran
    1.14 ± 0.01 times faster than Before Optimization

This optimizes how Parquet dictionary over binary is collected. Now, instead of pushing the items one at the time into a buffer. The dictionary is used as a buffer and views are made into that buffer. This should not only speed up the Parquet decoder, but should also reduce memory consumption and speed up subsequent operations. I did a small benchmark, but this does not really mean much. ``` Benchmark 1: After Optimization Time (mean ± σ): 2.007 s ± 0.005 s [User: 1.712 s, System: 0.523 s] Range (min … max): 2.000 s … 2.013 s 10 runs Benchmark 2: Before Optimization Time (mean ± σ): 2.285 s ± 0.009 s [User: 1.956 s, System: 0.595 s] Range (min … max): 2.274 s … 2.306 s 10 runs Summary After Optimization ran 1.14 ± 0.01 times faster than Before Optimization ```

codspeed-hq · 2024-07-07T15:30:21Z

CodSpeed Performance Report

Merging #17475 will improve performances by 26.82%

_{Comparing coastalwhite:parquet-binary-views (faff0ca) with main (ad836bd)}

Summary

⚡ 1 improvements
✅ 36 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`coastalwhite:parquet-binary-views`	Change
⚡	`test_tpch_q14`	2.7 ms	2.1 ms	+26.82%

codecov · 2024-07-07T17:05:08Z

Codecov Report

Attention: Patch coverage is 60.13514% with 59 lines in your changes missing coverage. Please review.

Project coverage is 80.50%. Comparing base (200c6a4) to head (faff0ca).
Report is 6 commits behind head on main.

Files	Patch %	Lines
...arquet/src/arrow/read/deserialize/binview/basic.rs	36.58%	26 Missing ⚠️
crates/polars-arrow/src/array/binview/view.rs	60.86%	18 Missing ⚠️
crates/polars-arrow/src/array/binview/mutable.rs	0.00%	9 Missing ⚠️
...polars-parquet/src/arrow/read/deserialize/utils.rs	86.95%	3 Missing ⚠️
...quet/src/parquet/encoding/hybrid_rle/translator.rs	89.65%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #17475      +/-   ##
==========================================
+ Coverage   80.46%   80.50%   +0.03%     
==========================================
  Files        1483     1483              
  Lines      194832   195122     +290     
  Branches     2770     2781      +11     
==========================================
+ Hits       156767   157078     +311     
+ Misses      37556    37533      -23     
- Partials      509      511       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46 · 2024-07-08T06:44:15Z

14% on a whole read is quite a lot! Considering you have decompression, decoding etc.

ritchie46 · 2024-07-08T06:47:42Z

crates/polars-parquet/src/arrow/read/deserialize/binview/basic.rs

+                let buffer_idx = if max_length <= View::MAX_INLINE_SIZE as usize {
+                    0
+                } else {
+                    values.push_buffer(page_dict.values().clone())


Ah, nice. So we don't have to go through the builder, but immediately use the buffer. 👍

coastalwhite requested review from ritchie46, stinodego, orlp and c-peters as code owners July 7, 2024 13:16

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jul 7, 2024

coastalwhite force-pushed the parquet-binary-views branch 2 times, most recently from d1afbc1 to 23680e4 Compare July 7, 2024 15:11

coastalwhite force-pushed the parquet-binary-views branch from 23680e4 to faff0ca Compare July 7, 2024 15:13

ritchie46 reviewed Jul 8, 2024

View reviewed changes

ritchie46 approved these changes Jul 8, 2024

View reviewed changes

ritchie46 merged commit b347717 into pola-rs:main Jul 8, 2024
21 checks passed

coastalwhite deleted the parquet-binary-views branch July 8, 2024 06:57

c-peters added the accepted Ready for implementation label Jul 8, 2024

c-peters assigned coastalwhite Jul 8, 2024

henryharbeck pushed a commit to henryharbeck/polars that referenced this pull request Jul 8, 2024

perf: Collect Parquet dictionary binary as view (pola-rs#17475)

b3cd7fb

agossard mentioned this pull request Jul 22, 2024

Incorrect result when filtering a Parquet file on categorical columns #17744

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Collect Parquet dictionary binary as view #17475

perf: Collect Parquet dictionary binary as view #17475

coastalwhite commented Jul 7, 2024 •

edited

Loading

codspeed-hq bot commented Jul 7, 2024

codecov bot commented Jul 7, 2024

ritchie46 commented Jul 8, 2024

ritchie46 Jul 8, 2024

perf: Collect Parquet dictionary binary as view #17475

perf: Collect Parquet dictionary binary as view #17475

Conversation

coastalwhite commented Jul 7, 2024 • edited Loading

codspeed-hq bot commented Jul 7, 2024

CodSpeed Performance Report

Merging #17475 will improve performances by 26.82%

Summary

Benchmarks breakdown

codecov bot commented Jul 7, 2024

Codecov Report

ritchie46 commented Jul 8, 2024

ritchie46 Jul 8, 2024

Choose a reason for hiding this comment

coastalwhite commented Jul 7, 2024 •

edited

Loading