Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Mar 20, 2021

This adds a function from_trusted_len_iter_bool to speed up the creation of an array for booleans.

Benchmarks are a bit noisy, but seems to be ~10-20% faster for comparison kernels. This also has some positive effect on DataFusion queries, as they contain quite some (nested) comparisons in filters. For example, executing tpch query 6 in memory is ~7% faster.

Gnuplot not found, using plotters backend
eq Float32              time:   [54.204 us 54.284 us 54.364 us]                       
                        change: [-29.087% -28.838% -28.581%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild

eq scalar Float32       time:   [43.660 us 43.743 us 43.830 us]                               
                        change: [-30.819% -30.545% -30.269%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

neq Float32             time:   [68.726 us 68.893 us 69.048 us]                        
                        change: [-14.045% -13.772% -13.490%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

neq scalar Float32      time:   [46.251 us 46.322 us 46.395 us]                                
                        change: [-12.204% -11.952% -11.702%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

lt Float32              time:   [50.264 us 50.438 us 50.613 us]                        
                        change: [-21.300% -20.964% -20.649%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [48.847 us 48.929 us 49.013 us]                               
                        change: [-10.132% -9.9180% -9.6910%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

lt_eq Float32           time:   [46.105 us 46.198 us 46.282 us]                           
                        change: [-21.276% -20.966% -20.703%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) low severe
  13 (13.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [47.359 us 47.456 us 47.593 us]                                  
                        change: [+0.2766% +0.5240% +0.7821%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

gt Float32              time:   [57.313 us 57.363 us 57.412 us]                       
                        change: [-18.328% -18.177% -18.031%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild

gt scalar Float32       time:   [44.091 us 44.132 us 44.175 us]                               
                        change: [-9.4233% -9.2747% -9.1273%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild

gt_eq Float32           time:   [55.856 us 55.932 us 56.007 us]                          
                        change: [-7.4997% -7.2656% -7.0334%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

gt_eq scalar Float32    time:   [42.365 us 42.419 us 42.482 us]                                  
                        change: [+0.5289% +0.7174% +0.9116%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

@Dandandan Dandandan changed the title ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for boolsU ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for bools Mar 20, 2021
@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Mar 20, 2021

Codecov Report

Merging #9759 (aa4a4fb) into master (29feea0) will decrease coverage by 0.03%.
The diff coverage is 77.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9759      +/-   ##
==========================================
- Coverage   82.59%   82.55%   -0.04%     
==========================================
  Files         248      249       +1     
  Lines       58294    58837     +543     
==========================================
+ Hits        48149    48575     +426     
- Misses      10145    10262     +117     
Impacted Files Coverage Δ
rust/arrow/src/util/bench_util.rs 0.00% <ø> (ø)
rust/datafusion/examples/flight_server.rs 0.00% <0.00%> (ø)
rust/datafusion/src/execution/dataframe_impl.rs 87.82% <0.00%> (-2.32%) ⬇️
rust/datafusion/src/optimizer/constant_folding.rs 92.30% <0.00%> (-0.36%) ⬇️
...datafusion/src/optimizer/hash_build_probe_order.rs 53.60% <0.00%> (-1.72%) ⬇️
...t/datafusion/src/optimizer/projection_push_down.rs 98.66% <ø> (ø)
rust/datafusion/src/physical_plan/mod.rs 88.00% <ø> (ø)
rust/parquet/src/basic.rs 88.59% <ø> (+1.36%) ⬆️
rust/parquet/src/schema/parser.rs 90.20% <0.00%> (ø)
rust/parquet/src/arrow/schema.rs 86.57% <54.47%> (-5.79%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81d6724...aa4a4fb. Read the comment docs.

@Dandandan
Copy link
Contributor Author

Dandandan commented Mar 20, 2021

@jorgecarleitao FYI

I also tried related code from your arrow2 repo, but got worse performance results from the approach taken there.

//collect (up to) 8 bits into a byte
while mask != 0 {
if let Some(value) = iterator.next() {
byte_accum |= match value {
Copy link
Contributor

@yordan-pavlov yordan-pavlov Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the bool iterator could be split into chunks (for example, using https://docs.rs/itertools/0.4.2/itertools/struct.Chunks.html or alternatively using https://doc.rust-lang.org/std/primitive.slice.html#method.chunks) of 8 bool values, then each chunk is mapped into a byte by converting each bool value into a byte (for example using std::mem::transmute::<bool, u8>), then shifting according to the position in the chunk, and applying in the output byte, and finally the resulting byte iterator would be used to build the buffer directly. This is the fastest implementation I can imagine because it eliminates as many conditions / checks as possible (and conditions are the enemy of fast).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think there are some faster ways to speed up the inner loop, yours sounds like a great idea to try out. I was also looking at the arrow2 repository of @jorgecarleitao , but I think I have been looking to an older commit before which turned out to be slower (I expected it to be faster, but sometimes the compiler can be quite surprising in what compiles to efficient code.

I think the latest version is over here:

https://github.com/jorgecarleitao/arrow2/blob/be905f1b1f0293ef427387bc35b2e9956ec3336f/src/bitmap/mutable.rs#L209

Copy link
Member

@jorgecarleitao jorgecarleitao Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you all ❤️

I admit I have spent an immoral amount of time trying to optimize bitmaps, but I have unfortunately not yet concluded what is the best way to handle them. I think that we may not being able to express to the compiler what we want it to do (some kind of operation over a single byte). @yordan-pavlov suggestion is a great one in that direction, though.

FWIW, on my computer (a VM on azure), arrow master (not this PR) is giving

eq Float32              time:   [113.40 us 114.81 us 116.28 us]
eq scalar Float32       time:   [96.824 us 98.638 us 101.34 us]

and arrow2 is giving

eq Float32              time:   [84.519 us 86.065 us 87.772 us]
eq scalar Float32       time:   [57.682 us 58.315 us 59.014 us]

This PR's idea on arrow2 (with corresponding changes) is giving me -14% on eq Float32 and +35% on eq scalar Float32. I pushed these benches to master there.

Note the difference between scalar and non-scalar: it is the exact same code on the trusted_len function, but a 30% difference in performance between them; imo this indicates that we are fighting with the compiler to try to explain what we are trying to express here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I wrote a minimal repo to evaluate these things and added a reddit post to try to get some help / feedback on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov's idea yields -50% on array-to-scalar and -10% on array-to-array 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov
I am not sure where the win would be in that case.
I would expect the first idea to be compiled to roughly the same code (if all compiler optimizations work out)?

For the Vec one - I would expect that would be slower as it introduces an extra loop / allocation and barrier for optimization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan yes, I agree that it's counter-intuitive, but I find that the compiler often surprises me so it's best to try every option; I will try to extend the benchmark repo when I have a bit more free time later today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, sometimes the result can be surprising.
I tried this variation, this compiles to the same unrolled 38 instructions:

            chunk.iter().map(|&c_i| c_i == rhs).enumerate().for_each(|(i, c_i)| {
                *byte |= if c_i {
                    1 << i
                } else {
                    0
                };

Copy link
Contributor

@yordan-pavlov yordan-pavlov Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried to benchmark doing the comparison separately, but it's not faster; on my machine the fastest version is:

(0..8).for_each(|i| {
                if chunk[i] == rhs {
                    *byte = set(*byte, i)
                }
            });

and that's even faster than:

chunk.iter().enumerate().for_each(|(i, &c_i)| {
                *byte |= unsafe { mem::transmute::<bool, u8>(c_i == rhs) << i };
            });

this is with the test data configured as:

let vec = (0..20049).map(|x| (x * x + x) % 2).collect::<Vec<_>>();

I think 2000 items (the old length of the test data) is much too small for realistic benchmarking, and it would make more sense to benchmark with test data with length same as the default batch size in DataFusion (I think this was recently increased).

Copy link
Contributor Author

@Dandandan Dandandan Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov

Also see findings in the PR here (with updated benchmark). (x * x + x) % 2 will give 0 on even and 1 on uneven inputs I, so the pattern/branches will be very predictable, especially in the scalar version.

jorgecarleitao/arrow2#17

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, @Dandandan !

jorgecarleitao added a commit to jorgecarleitao/arrow2 that referenced this pull request Mar 21, 2021
Implementation based on [this comment](apache/arrow#9759 (comment))
from @yordan-pavlov.

Performance:

eq Float32              time:   [70.990 us 71.775 us 72.636 us]
                        change: [-12.665% -10.799% -8.9867%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [26.832 us 27.174 us 27.540 us]
                        change: [-52.665% -51.655% -50.525%] (p = 0.00 < 0.05)
                        Performance has improved.
jorgecarleitao added a commit to jorgecarleitao/arrow2 that referenced this pull request Mar 21, 2021
Implementation based on [this comment](apache/arrow#9759 (comment))
from @yordan-pavlov.

Performance:

eq Float32              time:   [70.990 us 71.775 us 72.636 us]
                        change: [-12.665% -10.799% -8.9867%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [26.832 us 27.174 us 27.540 us]
                        change: [-52.665% -51.655% -50.525%] (p = 0.00 < 0.05)
                        Performance has improved.
@yordan-pavlov
Copy link
Contributor

@jorgecarleitao thank you for trying out my suggestion; I am very happy to see this resulted in some tangible performance improvements. @Dandandan I love the use of godbolt.org, I should be using it more to gain further insight into what code compiles into efficient instructions.

@Dandandan Dandandan changed the title ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for bools ARROW-12032: [Rust] Optimize comparison kernels Mar 21, 2021
@Dandandan
Copy link
Contributor Author

Dandandan commented Mar 21, 2021

I added the faster implementations using chunks for primitives to this PR and fixed the benchmarks.

@alamb alamb closed this in 8dd6abb Mar 24, 2021
@seddonm1
Copy link
Contributor

seddonm1 commented Mar 25, 2021

@Dandandan @alamb

Guys this is failing to compile on my machine. It looks like the CICD failed but was merged?

(Sorry ignore) saw: #9796

@alamb
Copy link
Contributor

alamb commented Mar 25, 2021

Guys this is failing to compile on my machine. It looks like the CICD failed but was merged?

Yeah sorry @seddonm1 -- CI passed on this PR @ aa4a4fb but it had a logical conflict with another PR that was previously merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants