ARROW-12032: [Rust] Optimize comparison kernels #9759

Dandandan · 2021-03-20T13:23:53Z

This adds a function from_trusted_len_iter_bool to speed up the creation of an array for booleans.

Benchmarks are a bit noisy, but seems to be ~10-20% faster for comparison kernels. This also has some positive effect on DataFusion queries, as they contain quite some (nested) comparisons in filters. For example, executing tpch query 6 in memory is ~7% faster.

Gnuplot not found, using plotters backend
eq Float32              time:   [54.204 us 54.284 us 54.364 us]                       
                        change: [-29.087% -28.838% -28.581%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild

eq scalar Float32       time:   [43.660 us 43.743 us 43.830 us]                               
                        change: [-30.819% -30.545% -30.269%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

neq Float32             time:   [68.726 us 68.893 us 69.048 us]                        
                        change: [-14.045% -13.772% -13.490%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

neq scalar Float32      time:   [46.251 us 46.322 us 46.395 us]                                
                        change: [-12.204% -11.952% -11.702%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

lt Float32              time:   [50.264 us 50.438 us 50.613 us]                        
                        change: [-21.300% -20.964% -20.649%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [48.847 us 48.929 us 49.013 us]                               
                        change: [-10.132% -9.9180% -9.6910%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

lt_eq Float32           time:   [46.105 us 46.198 us 46.282 us]                           
                        change: [-21.276% -20.966% -20.703%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) low severe
  13 (13.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [47.359 us 47.456 us 47.593 us]                                  
                        change: [+0.2766% +0.5240% +0.7821%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

gt Float32              time:   [57.313 us 57.363 us 57.412 us]                       
                        change: [-18.328% -18.177% -18.031%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild

gt scalar Float32       time:   [44.091 us 44.132 us 44.175 us]                               
                        change: [-9.4233% -9.2747% -9.1273%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild

gt_eq Float32           time:   [55.856 us 55.932 us 56.007 us]                          
                        change: [-7.4997% -7.2656% -7.0334%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

gt_eq scalar Float32    time:   [42.365 us 42.419 us 42.482 us]                                  
                        change: [+0.5289% +0.7174% +0.9116%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

github-actions · 2021-03-20T13:24:12Z

https://issues.apache.org/jira/browse/ARROW-12032

codecov-io · 2021-03-20T14:35:33Z

Codecov Report

Merging #9759 (aa4a4fb) into master (29feea0) will decrease coverage by 0.03%.
The diff coverage is 77.66%.

@@            Coverage Diff             @@
##           master    #9759      +/-   ##
==========================================
- Coverage   82.59%   82.55%   -0.04%     
==========================================
  Files         248      249       +1     
  Lines       58294    58837     +543     
==========================================
+ Hits        48149    48575     +426     
- Misses      10145    10262     +117

Impacted Files	Coverage Δ
rust/arrow/src/util/bench_util.rs	`0.00% <ø> (ø)`
rust/datafusion/examples/flight_server.rs	`0.00% <0.00%> (ø)`
rust/datafusion/src/execution/dataframe_impl.rs	`87.82% <0.00%> (-2.32%)`	⬇️
rust/datafusion/src/optimizer/constant_folding.rs	`92.30% <0.00%> (-0.36%)`	⬇️
...datafusion/src/optimizer/hash_build_probe_order.rs	`53.60% <0.00%> (-1.72%)`	⬇️
...t/datafusion/src/optimizer/projection_push_down.rs	`98.66% <ø> (ø)`
rust/datafusion/src/physical_plan/mod.rs	`88.00% <ø> (ø)`
rust/parquet/src/basic.rs	`88.59% <ø> (+1.36%)`	⬆️
rust/parquet/src/schema/parser.rs	`90.20% <0.00%> (ø)`
rust/parquet/src/arrow/schema.rs	`86.57% <54.47%> (-5.79%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81d6724...aa4a4fb. Read the comment docs.

Dandandan · 2021-03-20T17:03:44Z

@jorgecarleitao FYI

I also tried related code from your arrow2 repo, but got worse performance results from the approach taken there.

yordan-pavlov · 2021-03-20T20:08:00Z

rust/arrow/src/buffer/mutable.rs

+            //collect (up to) 8 bits into a byte
+            while mask != 0 {
+                if let Some(value) = iterator.next() {
+                    byte_accum |= match value {


I wonder if the bool iterator could be split into chunks (for example, using https://docs.rs/itertools/0.4.2/itertools/struct.Chunks.html or alternatively using https://doc.rust-lang.org/std/primitive.slice.html#method.chunks) of 8 bool values, then each chunk is mapped into a byte by converting each bool value into a byte (for example using std::mem::transmute::<bool, u8>), then shifting according to the position in the chunk, and applying in the output byte, and finally the resulting byte iterator would be used to build the buffer directly. This is the fastest implementation I can imagine because it eliminates as many conditions / checks as possible (and conditions are the enemy of fast).

I also think there are some faster ways to speed up the inner loop, yours sounds like a great idea to try out. I was also looking at the arrow2 repository of @jorgecarleitao , but I think I have been looking to an older commit before which turned out to be slower (I expected it to be faster, but sometimes the compiler can be quite surprising in what compiles to efficient code.

I think the latest version is over here:

https://github.com/jorgecarleitao/arrow2/blob/be905f1b1f0293ef427387bc35b2e9956ec3336f/src/bitmap/mutable.rs#L209

I agree with you all ❤️

I admit I have spent an immoral amount of time trying to optimize bitmaps, but I have unfortunately not yet concluded what is the best way to handle them. I think that we may not being able to express to the compiler what we want it to do (some kind of operation over a single byte). @yordan-pavlov suggestion is a great one in that direction, though.

FWIW, on my computer (a VM on azure), arrow master (not this PR) is giving

eq Float32 time: [113.40 us 114.81 us 116.28 us] eq scalar Float32 time: [96.824 us 98.638 us 101.34 us]

and arrow2 is giving

eq Float32 time: [84.519 us 86.065 us 87.772 us] eq scalar Float32 time: [57.682 us 58.315 us 59.014 us]

This PR's idea on arrow2 (with corresponding changes) is giving me -14% on eq Float32 and +35% on eq scalar Float32. I pushed these benches to master there.

Note the difference between scalar and non-scalar: it is the exact same code on the trusted_len function, but a 30% difference in performance between them; imo this indicates that we are fighting with the compiler to try to explain what we are trying to express here.

FYI, I wrote a minimal repo to evaluate these things and added a reddit post to try to get some help / feedback on this.

@yordan-pavlov's idea yields -50% on array-to-scalar and -10% on array-to-array 🚀

@yordan-pavlov
I am not sure where the win would be in that case.
I would expect the first idea to be compiled to roughly the same code (if all compiler optimizations work out)?

For the Vec one - I would expect that would be slower as it introduces an extra loop / allocation and barrier for optimization?

@Dandandan yes, I agree that it's counter-intuitive, but I find that the compiler often surprises me so it's best to try every option; I will try to extend the benchmark repo when I have a bit more free time later today

You are right, sometimes the result can be surprising.
I tried this variation, this compiles to the same unrolled 38 instructions:

chunk.iter().map(|&c_i| c_i == rhs).enumerate().for_each(|(i, c_i)| { *byte |= if c_i { 1 << i } else { 0 };

yes, I tried to benchmark doing the comparison separately, but it's not faster; on my machine the fastest version is:

(0..8).for_each(|i| { if chunk[i] == rhs { *byte = set(*byte, i) } });

and that's even faster than:

chunk.iter().enumerate().for_each(|(i, &c_i)| { *byte |= unsafe { mem::transmute::<bool, u8>(c_i == rhs) << i }; });

this is with the test data configured as:

let vec = (0..20049).map(|x| (x * x + x) % 2).collect::<Vec<_>>();

I think 2000 items (the old length of the test data) is much too small for realistic benchmarking, and it would make more sense to benchmark with test data with length same as the default batch size in DataFusion (I think this was recently increased).

@yordan-pavlov

Also see findings in the PR here (with updated benchmark). (x * x + x) % 2 will give 0 on even and 1 on uneven inputs I, so the pattern/branches will be very predictable, especially in the scalar version.

jorgecarleitao/arrow2#17

jorgecarleitao

LGTM. Thanks a lot, @Dandandan !

@yordan-pavlov

Implementation based on [this comment](apache/arrow#9759 (comment)) from @yordan-pavlov. Performance: eq Float32 time: [70.990 us 71.775 us 72.636 us] change: [-12.665% -10.799% -8.9867%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe eq scalar Float32 time: [26.832 us 27.174 us 27.540 us] change: [-52.665% -51.655% -50.525%] (p = 0.00 < 0.05) Performance has improved.

@yordan-pavlov

Implementation based on [this comment](apache/arrow#9759 (comment)) from @yordan-pavlov. Performance: eq Float32 time: [70.990 us 71.775 us 72.636 us] change: [-12.665% -10.799% -8.9867%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe eq scalar Float32 time: [26.832 us 27.174 us 27.540 us] change: [-52.665% -51.655% -50.525%] (p = 0.00 < 0.05) Performance has improved.

yordan-pavlov · 2021-03-21T12:10:20Z

@jorgecarleitao thank you for trying out my suggestion; I am very happy to see this resulted in some tangible performance improvements. @Dandandan I love the use of godbolt.org, I should be using it more to gain further insight into what code compiles into efficient instructions.

Dandandan · 2021-03-21T22:04:57Z

I added the faster implementations using chunks for primitives to this PR and fixed the benchmarks.

seddonm1 · 2021-03-25T02:18:14Z

@Dandandan @alamb

Guys this is failing to compile on my machine. It looks like the CICD failed but was merged?

(Sorry ignore) saw: #9796

alamb · 2021-03-25T10:55:03Z

Guys this is failing to compile on my machine. It looks like the CICD failed but was merged?

Yeah sorry @seddonm1 -- CI passed on this PR @ aa4a4fb but it had a logical conflict with another PR that was previously merged

Use trusted_len for comparison kernels

f233f47

Dandandan changed the title ~~ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for boolsU~~ ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for bools Mar 20, 2021

github-actions bot added the Component: Rust label Mar 20, 2021

Fmt

1aa7baf

Dandandan added 3 commits March 20, 2021 15:39

Add documentation and doctest

5f2dfa2

Add documentation

c4cb72a

Clippy

0e1c8ed

yordan-pavlov reviewed Mar 20, 2021

View reviewed changes

jorgecarleitao approved these changes Mar 20, 2021

View reviewed changes

jorgecarleitao mentioned this pull request Mar 21, 2021

Improved performance of primitive comparisons. jorgecarleitao/arrow2#16

Merged

Dandandan added 2 commits March 21, 2021 17:47

Fix benches

d51e75e

Add faster primitive versions for comparison kernels

aa4a4fb

Dandandan changed the title ~~ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for bools~~ ARROW-12032: [Rust] Optimize comparison kernels Mar 21, 2021

alamb closed this in 8dd6abb Mar 24, 2021

asfimport mentioned this pull request Mar 25, 2021

[Rust] Optimize comparison kernels using trusted_len iterator for bools #27861

Closed

Uh oh!

ARROW-12032: [Rust] Optimize comparison kernels #9759

ARROW-12032: [Rust] Optimize comparison kernels #9759

Uh oh!

Conversation

Dandandan commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2021

Uh oh!

codecov-io commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dandandan commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yordan-pavlov Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 20, 2021

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov Mar 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov commented Mar 21, 2021

Uh oh!

Dandandan commented Mar 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seddonm1 commented Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Dandandan commented Mar 20, 2021 •

edited

Loading

codecov-io commented Mar 20, 2021 •

edited

Loading

Dandandan commented Mar 20, 2021 •

edited

Loading

yordan-pavlov Mar 20, 2021 •

edited

Loading

jorgecarleitao Mar 20, 2021 •

edited

Loading

yordan-pavlov Mar 21, 2021 •

edited

Loading

Dandandan Mar 21, 2021 •

edited

Loading

Dandandan commented Mar 21, 2021 •

edited

Loading

seddonm1 commented Mar 25, 2021 •

edited

Loading