Add specialized filter kernels in `compute` module (up to 10x faster) #1248

tustvold · 2022-01-30T00:03:57Z

Which issue does this PR close?

Rationale for this change

Make filter kernels faster. This relates to the observation underlying #1225 and #1229, that the filter kernels are surprisingly slow for what they are doing. In many cases the filter kernels on master are slower than converting the filter to a list of indices and using take...

Currently this PR is ~10x faster than master, and I suspect this could be pushed further.

filter u8               time:   [47.525 us 47.538 us 47.554 us]                       
                        change: [-90.065% -90.056% -90.048%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.3351 us 2.3358 us 2.3365 us]
                        change: [-81.582% -81.570% -81.558%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.3204 us 1.3218 us 1.3233 us]
                        change: [-70.639% -70.572% -70.512%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

This builds on #1228 and adds specialized filter implementations for primitive array types. I suspect specialized implementations for byte array types, and dictionaries would likely yield similarly significant benefits and I may include them in this PR or a follow up.

Aside from the performance benefits, having specialized impls, much like we do for the take kernels will also allow us to take advantage of SIMD intrinsics either through manual implementation, or just helping the compiler's auto-vectorization.

Are there any user-facing changes?

No

tustvold · 2022-01-30T00:05:53Z

arrow/src/compute/kernels/filter.rs

+
+    let mut buffer = MutableBuffer::with_capacity(filter_count * T::get_byte_width());
+
+    let selectivity_frac = filter_count as f64 / filter.len() as f64;


Credit to @sunchao for the inspiration behind this hybrid approach - #1191 (comment)

tustvold · 2022-01-30T00:09:22Z

arrow/src/compute/kernels/filter.rs

-            let mut mutable =
-                MutableArrayData::new(vec![array.data_ref()], false, filter_count);
+        // actually filter
+        _ => match values.data_type() {


This block is copied straight from the take kernel - there is probably a way to avoid this duplication with some more layers of macros

codecov-commenter · 2022-01-30T00:24:00Z

Codecov Report

Merging #1248 (9fc5fac) into master (1764399) will decrease coverage by 0.00%.
The diff coverage is 75.60%.

@@            Coverage Diff             @@
##           master    #1248      +/-   ##
==========================================
- Coverage   82.99%   82.99%   -0.01%     
==========================================
  Files         180      180              
  Lines       52288    52587     +299     
==========================================
+ Hits        43396    43644     +248     
- Misses       8892     8943      +51

Impacted Files	Coverage Δ
arrow/src/util/bench_util.rs	`95.23% <ø> (ø)`
arrow/src/compute/kernels/filter.rs	`84.77% <75.60%> (-7.73%)`	⬇️
arrow/src/datatypes/datatype.rs	`66.38% <0.00%> (-0.43%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️
arrow/src/ffi.rs	`84.69% <0.00%> (ø)`
arrow/src/csv/reader.rs	`88.12% <0.00%> (ø)`
parquet/src/data_type.rs	`76.61% <0.00%> (ø)`
arrow/src/array/array_list.rs	`95.52% <0.00%> (ø)`
arrow/src/array/array_union.rs	`90.76% <0.00%> (ø)`
arrow/src/array/raw_pointer.rs	`100.00% <0.00%> (ø)`
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1764399...9fc5fac. Read the comment docs.

Dandandan · 2022-01-30T15:51:02Z

This is really cool and promising!. In my experience the filter kernels can be quite expensive in benchmarks. Great stuff 😃

tustvold · 2022-01-30T22:54:54Z

I found some time this afternoon, so bashed out porting the filter context abstraction (caching the selection vector) and fixing up the null buffer construction.

Here's where we stand now, across the board about 10x performance uplift other than for high selectivity with nulls where the bottleneck on copying ranges of null bitmasks is unchanged

filter u8               time:   [48.733 us 48.795 us 48.889 us]                       
                        change: [-90.138% -90.127% -90.116%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.4004 us 2.4048 us 2.4099 us]
                        change: [-81.016% -80.967% -80.915%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.4000 us 1.4015 us 1.4032 us]
                        change: [-67.378% -67.309% -67.246%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8       time:   [14.783 us 14.798 us 14.816 us]                               
                        change: [-95.157% -95.150% -95.143%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 high selectivity                                                                             
                        time:   [1.1631 us 1.1636 us 1.1642 us]
                        change: [-85.209% -85.201% -85.192%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 low selectivity                                                                            
                        time:   [150.47 ns 150.64 ns 150.83 ns]
                        change: [-84.434% -84.380% -84.295%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs                                                                             
                        time:   [40.762 us 40.771 us 40.781 us]
                        change: [-89.275% -89.267% -89.259%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs high selectivity                                                                             
                        time:   [7.0784 us 7.0828 us 7.0876 us]
                        change: [+2.4840% +2.5549% +2.6258%] (p = 0.00 < 0.05)
                        Performance has regressed.

filter context u8 w NULLs low selectivity                                                                            
                        time:   [267.79 ns 267.96 ns 268.12 ns]
                        change: [-72.091% -72.023% -71.946%] (p = 0.00 < 0.05)
                        Performance has improved.

filter f32              time:   [117.24 us 117.27 us 117.32 us]                       
                        change: [-79.651% -79.637% -79.623%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32      time:   [41.474 us 41.486 us 41.499 us]                                
                        change: [-88.917% -88.911% -88.905%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32 high selectivity                                                                             
                        time:   [11.289 us 11.292 us 11.294 us]
                        change: [+3.9848% +4.0461% +4.1142%] (p = 0.00 < 0.05)
                        Performance has regressed.

filter context f32 low selectivity                                                                            
                        time:   [295.75 ns 296.19 ns 296.65 ns]
                        change: [-69.321% -69.181% -69.008%] (p = 0.00 < 0.05)
                        Performance has improved.

filter single record batch                                                                            
                        time:   [69.716 us 69.749 us 69.783 us]
                        change: [-86.024% -86.009% -85.998%] (p = 0.00 < 0.05)
                        Performance has improved.

What is interesting, at least to me, is the performance tax imposed by the packed bitmask representation of BooleanArray - even with non-trivial bit-twiddling shenanigans it still appears to be more performant to hydrate the filter to an array of indices or slices when filtering multiple arrays.

filter optimize         time:   [45.865 us 45.990 us 46.096 us]                             

filter optimize high selectivity                                                                             
                        time:   [1.6900 us 1.6910 us 1.6920 us]

filter optimize low selectivity                                                                             
                        time:   [1.4087 us 1.4116 us 1.4149 us]

Perhaps the compiler just struggles to auto-vectorise bitmask loops or something, not sure 😅

yordan-pavlov · 2022-01-31T19:08:29Z

arrow/src/compute/kernels/filter.rs

+}
+
+impl FilterBuilder {
+    /// Create a new [`FilterBuilder`] that can be used construct [`FilterPredicate`]


this probably should read: ... that can be used TO construct A FilterPredicate

yordan-pavlov · 2022-01-31T19:28:54Z

this looks very promising @tustvold - excellent performance improvement; when I made the filter kernel up to 550 times faster in 457ea62 I was wondering if this was the limit of performance or it could be made even faster, and I am very happy to see that you have been able to reach a new level of performance for the filter kernel

tustvold · 2022-01-31T20:13:06Z

550 times faster

Damn, that's some improvement 😆

jhorstmann · 2022-02-01T14:00:30Z

To get a good feeling for how close we are to the theoretical peak performance it could be useful to add throughput numbers to the criterion benchmark. When criterion know the number of bytes that are processed (size of the primitive array + size of filter bitmap), it can report the results in Gb/s, which can be compared to the max single-threaded memory bandwidth of the machine, which is usually between 10-15Gb/s. When the arrays fit into L1 cache it can be even higher.

The filter benchmarks are also a bit focused on bytes, I don't know how common those are in DataFusion, but I would think that 32 or 64 bit numbers are much more common and should be the focus of benchmarking.

jhorstmann · 2022-02-01T14:05:15Z

What is interesting, at least to me, is the performance tax imposed by the packed bitmask representation of BooleanArray - even with non-trivial bit-twiddling shenanigans it still appears to be more performant to hydrate the filter to an array of indices or slices when filtering multiple arrays.

On recent intel machines there would be a much faster way using the pext instruction. That basically implements a filter for 64 bits at a time.

tustvold · 2022-02-01T14:46:45Z

The filter benchmarks are also a bit focused on bytes, I don't know how common those are in DataFusion, but I would think that 32 or 64 bit numbers are much more common and should be the focus of benchmarking.

Yeah, the f32 benchmarks are encouraging that the performance uplift isn't just a quirk of the fact the benchmarks currently use u8 for integers.

On recent intel machines there would be a much faster way using the pext instruction. That basically implements a filter for 64 bits at a time.

Yeah, I do not doubt some artisanal use of intrinsics could likely eliminate much of the cost of bitmasks, perhaps even surpassing indices. I had I guess hoped that it would be possible to get further before resorting to the arcane 😆

tustvold · 2022-02-02T23:33:16Z

So I've added string and dictionary support. Strings only see a ~2x performance bump, this can almost certainly be pushed further, but I figured it was good enough and I can come back to it in subsequent PRs. Dictionaries see the full 9-10x bump that the integer types see which is 👌

I'm marking this ready for review because I think it represents a non-trivial improvement that would be super-awesome to make it into the arrow 9 release. There is definitely further refinement that could take place, and more perf that could be squeezed out, but I'd prefer to do that in subsequent PRs unless people object.

For reference here are the latest numbers, down a little bit since #1228 was merged, but still pretty exciting imo

filter u8               time:   [47.731 us 47.740 us 47.748 us]                       
                        change: [-85.383% -85.377% -85.371%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.4163 us 2.4185 us 2.4210 us]
                        change: [-73.969% -73.939% -73.909%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.3798 us 1.3814 us 1.3829 us]
                        change: [-41.966% -41.649% -41.441%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8       time:   [14.557 us 14.567 us 14.580 us]                               
                        change: [-94.178% -94.175% -94.171%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 high selectivity                                                                             
                        time:   [1.1673 us 1.1678 us 1.1684 us]
                        change: [-84.915% -84.905% -84.894%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 low selectivity                                                                            
                        time:   [150.77 ns 150.80 ns 150.83 ns]
                        change: [-82.779% -82.758% -82.735%] (p = 0.00 < 0.05)
                        Performance has improved.

filter i32              time:   [68.835 us 68.857 us 68.882 us]                       
                        change: [-78.627% -78.613% -78.598%] (p = 0.00 < 0.05)
                        Performance has improved.

filter i32 high selectivity                                                                             
                        time:   [6.1117 us 6.1135 us 6.1154 us]
                        change: [-54.814% -54.788% -54.759%] (p = 0.00 < 0.05)
                        Performance has improved.

filter i32 low selectivity                                                                             
                        time:   [1.3835 us 1.3858 us 1.3895 us]
                        change: [-41.772% -41.682% -41.570%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32      time:   [16.049 us 16.058 us 16.068 us]                                
                        change: [-93.552% -93.545% -93.535%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 high selectivity                                                                             
                        time:   [4.7336 us 4.7417 us 4.7507 us]
                        change: [-59.764% -59.459% -58.934%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 low selectivity                                                                            
                        time:   [141.69 ns 141.83 ns 141.98 ns]
                        change: [-83.970% -83.929% -83.895%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 w NULLs                                                                             
                        time:   [42.514 us 42.529 us 42.548 us]
                        change: [-86.620% -86.613% -86.606%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context i32 w NULLs high selectivity                                                                             
                        time:   [10.785 us 10.788 us 10.790 us]
                        change: [+0.8230% +1.0819% +1.2557%] (p = 0.00 < 0.05)
                        Change within noise threshold.

filter context i32 w NULLs low selectivity                                                                            
                        time:   [279.94 ns 280.04 ns 280.13 ns]
                        change: [-68.389% -68.317% -68.242%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs                                                                             
                        time:   [40.525 us 40.540 us 40.557 us]
                        change: [-87.204% -87.196% -87.187%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context u8 w NULLs high selectivity                                                                             
                        time:   [7.0863 us 7.0895 us 7.0928 us]
                        change: [+3.4115% +3.4779% +3.5553%] (p = 0.00 < 0.05)
                        Performance has regressed.

filter context u8 w NULLs low selectivity                                                                            
                        time:   [270.69 ns 270.79 ns 270.89 ns]
                        change: [-67.961% -67.906% -67.849%] (p = 0.00 < 0.05)
                        Performance has improved.

filter f32              time:   [120.99 us 121.18 us 121.56 us]                       
                        change: [-68.315% -68.281% -68.241%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32      time:   [36.935 us 36.951 us 36.968 us]                                
                        change: [-88.072% -88.059% -88.045%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32 high selectivity                                                                             
                        time:   [10.339 us 10.342 us 10.346 us]
                        change: [-1.7648% -1.4694% -1.2565%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context f32 low selectivity                                                                            
                        time:   [287.59 ns 288.40 ns 289.16 ns]
                        change: [-67.322% -67.087% -66.921%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string   time:   [243.62 us 243.66 us 243.72 us]                                  
                        change: [-40.960% -40.903% -40.858%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string high selectivity                                                                            
                        time:   [75.645 us 75.664 us 75.682 us]
                        change: [-81.225% -81.211% -81.199%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string low selectivity                                                                             
                        time:   [666.91 ns 667.33 ns 667.80 ns]
                        change: [-42.717% -42.635% -42.556%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary                                                                             
                        time:   [16.348 us 16.354 us 16.359 us]
                        change: [-89.690% -89.684% -89.677%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary high selectivity                                                                              
                        time:   [4.8464 us 4.8530 us 4.8617 us]
                        change: [-83.162% -83.106% -83.056%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary low selectivity                                                                            
                        time:   [318.97 ns 319.14 ns 319.32 ns]
                        change: [-49.861% -49.819% -49.770%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary w NULLs                                                                             
                        time:   [42.654 us 42.726 us 42.806 us]
                        change: [-87.775% -87.728% -87.657%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary w NULLs high selectivity                                                                             
                        time:   [11.313 us 11.316 us 11.319 us]
                        change: [-68.181% -68.152% -68.130%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string dictionary w NULLs low selectivity                                                                            
                        time:   [480.59 ns 480.81 ns 481.04 ns]
                        change: [-54.683% -54.630% -54.579%] (p = 0.00 < 0.05)
                        Performance has improved.

filter single record batch                                                                            
                        time:   [62.108 us 62.210 us 62.292 us]
                        change: [-80.890% -80.862% -80.837%] (p = 0.00 < 0.05)
                        Performance has improved.

tustvold · 2022-02-03T09:38:27Z

arrow/src/compute/kernels/filter.rs

+
+/// A builder to construct [`FilterPredicate`]
+#[derive(Debug)]
+pub struct FilterBuilder {


The idea behind this is that we may wish to give the user more control over how the predicate is executed, by creating a builder we have an obvious extension point for doing this

tustvold · 2022-02-03T09:40:18Z

arrow/src/compute/kernels/filter.rs

+/// used to build a new [`GenericStringArray`] by copying values from the source
+///
+/// TODO(raphael): Could this be used for the take kernel as well?
+struct FilterString<'a, OffsetSize> {


In subsequent PRs I intend to experiment with applying a similar construction to primitive arrays and null bitmasks, e.g. FilterPrimitive and FilterNull as I think it might compose nicely and potentially allow for code sharing with the take kernels.

alamb

This is very impressive work @tustvold 👏

I went through this code pretty carefully. It makes sense to me, though I had some comments / requests for clarification

Before merging this I would really like another pair of eyes on this, perhaps @jhorstmann @yordan-pavlov , @ritchie46 or @nevi-me would be willing to review 🙏

arrow/benches/filter_kernels.rs

arrow/src/util/bench_util.rs

alamb · 2022-02-03T16:43:50Z

arrow/src/compute/kernels/filter.rs

@@ -119,17 +147,83 @@ impl<'a> Iterator for SlicesIterator<'a> {
    }
 }

+/// An iterator of `usize` whose index in [`BooleanArray`] is true
+///
+/// This provides the best performance on all but the most selective predicates, where the


Suggested change

/// This provides the best performance on all but the most selective predicates, where the

/// This provides the best performance on all but the least selective predicates (which keep most / all rows), where the

I think "least selective" means "high selectivity", confusingly

Is it also worth mentioning that it requires the filter to be entirely non-null?

NULLs within the filter itself are handled by prep_null_mask_filter - i.e. they are converted to false prior to any of this code seeing it

my understanding of the meaning of "highly selective filter" is that it filters out most of the data

Yeah, highly selective means low selectivity. Selectivity is a measure of how many rows it selects, selective-ness is a measure of how few rows it selects. Or something like that... This terminology is still new to me 😄

arrow/src/compute/kernels/filter.rs

alamb · 2022-02-03T18:04:18Z

arrow/src/compute/kernels/filter.rs

+/// `filter` implementation for string arrays
+///
+/// Note: NULLs with a non-zero slot length in `array` will have the corresponding
+/// data copied across. This allows handling the null mask separately from the data


I thought this maybe was handled by applying prep_null_mask_filter to the filter first?

prep_null_mask_filter handles nulls in the filter, not nulls in the array being filtered.

arrow/src/compute/kernels/filter.rs

alamb · 2022-02-03T18:08:57Z

arrow/src/compute/kernels/filter.rs

@@ -696,7 +1273,8 @@ mod tests {
            .flat_map(|(idx, v)| v.then(|| idx))
            .collect();

-        assert_eq!(bits, expected_bits);
+        assert_eq!(slice_bits, expected_bits);
+        assert_eq!(index_bits, expected_bits);
    }

    #[test]


Do you feel good about the level of testing in this module (specifically with filters of selectivity 0.8 and higher / lower)?

Eeh... I'm fairly confident, but more tests couldn't hurt. I'll see what I can come up with

Added a fuzz test in d587a46

tustvold · 2022-02-04T22:02:57Z

Thank you all for taking the time to review this, I think I've incorporated all feedback, but let me know if I've missed something 😅

yordan-pavlov

thanks @tustvold, looks great

arrow/src/compute/kernels/filter.rs

Dandandan

Looks great! Very impressive improvements, can't wait to see this landed in DataFusion.

Dandandan · 2022-02-06T09:11:12Z

FYI @jorgecarleitao might be good to compare performance to arrow2 after those changes?

Dandandan · 2022-02-06T09:15:58Z

arrow/src/compute/kernels/filter.rs

+
+    /// Extends the in-progress array by the ranges in the provided iterator
+    fn extend_slices(&mut self, iter: impl Iterator<Item = (usize, usize)>) {
+        for slice in iter {


Suggested change

for slice in iter {

for (start, end) in iter {

Small style suggestion

jorgecarleitao · 2022-02-06T09:34:12Z

Last time I benched arrow2 was 5x faster than arrow (results here), so this most likely beats it :)

I haven't had the time to go through this in detail, but the approach is very well though. Kudos to @tustvold !

arrow/src/compute/kernels/filter.rs

jorgecarleitao

Ok, I finished going through this in detail. Thanks a lot @tustvold , quite powerful ideas here.

The IterationStrategy design of switching between slices and indexes in particular is quite strong, imo.

I do wonder how the performance diff is so dramatic for primitives. My initial hypothesis is that the perf is primary due to the selectivity split between index vs slice iteration, since I can't see any major differences in the slicing code.

💯

EDIT: Ahhh, I now understand, it is the specialization of the mutableArrayData that offers the performance. Makes total sense.

arrow/src/compute/kernels/filter.rs

sunchao · 2022-02-06T19:03:30Z

arrow/src/compute/kernels/filter.rs

-        // now we only have a boolean mask to deal with
-        let predicate = prep_null_mask_filter(predicate);
-        return filter(array, &predicate);
+pub fn filter(values: &dyn Array, predicate: &BooleanArray) -> Result<ArrayRef> {


Not quite related to this PR, but I'm wondering that in the case there are multiple predicates on the same input array, maybe it's beneficial to only materialize the final output array after all the predicates have been evaluated. It seems we don't have such API yet?

If you had multiple predicates, I presume you'd evaluate them separately and then AND the filters together to get the final filter to "materialize". I'm not sure if this is what you had in mind?

Yea that's what I have in mind. I'm thinking that for a chain of ANDs maybe we don't need to evaluate them separately but rather evaluate on the result of the previous one, and skip materialization only until the last predicate (or short-circuit if all values are filtered out before that)

Definitely something to measure. I think it would be hard to beat a strategy that materializes very selective predicates and otherwise just evaluates predicates against already "eliminated" rows. A masked comparison kernel, I'd expect to have non-trivial branching overheads, but I honestly don't know 😄 I guess it also depends on how expensive the predicate is to evaluate, a complex cast might tip the scales... 🤔

I don't know either. For the chained ANDs case, maybe in future it is worth to evaluate the approach based on selection vector and only materialize the output array at the end.

sunchao · 2022-02-06T19:11:59Z

arrow/src/compute/kernels/filter.rs

+            _ => prep_null_mask_filter(filter),
+        };
+
+        let count = filter_count(&filter);


For the special case where count is equal to the length of original array (i.e., all values are selected), is there a faster path?

Good shout - will add 👍

Added in 2aa46b0

What about the same for 0 (no selected values)?

This was actually being handled at a higher level in filter_array, I've cleaned this up 7b4a2f1

sunchao · 2022-02-06T19:13:28Z

Great work @tustvold ! really looking forward to this improvement!

tustvold · 2022-02-06T20:35:12Z

the specialization of the mutableArrayData that offers the performance

Yeah, setting FILTER_SLICES_SELECTIVITY_THRESHOLD to 0, i.e. forcing slice iteration, shows a roughly 2-3x performance improvement on low selectivity filters solely from moving away from MutableArrayData. This isn't all that surprising considering each range is 2 dynamic dispatches 😅 The remaining 5x comes from the iteration strategy.

viirya · 2022-02-07T21:13:24Z

arrow/src/compute/kernels/filter.rs

@@ -119,17 +155,83 @@ impl<'a> Iterator for SlicesIterator<'a> {
    }
 }

+/// An iterator of `usize` whose index in [`BooleanArray`] is true
+///
+/// This provides the best performance on all but the least selective predicates (which keep most


nvm if I read it wrongly. But doesn't "the least selective predicates" means keep only a few rows? I think here it wants to say is, the highly selective predicates should favours SlicesIterator, right?

I think a filter with high selectivity is considered to be not very selective, i.e. it isn't very discerning about what it selects. However, this terminology seems to be a source of a lot of confusion, myself included, so I'm just going to reword it 😆

viirya · 2022-02-07T21:13:49Z

arrow/src/compute/kernels/filter.rs

+/// slots of a [BooleanArray] are true. Each interval corresponds to a contiguous region of memory
+/// to be "taken" from an array to be filtered.
+///
+/// This is only performant for the least selective filters that copy across long contiguous runs


ditto, I think this should be highly selective filters?

viirya · 2022-02-07T21:27:48Z

arrow/src/compute/kernels/filter.rs

+                        let iter = SlicesIterator::new(&predicate.filter);
+                        iter.for_each(|(start, end)| mutable.extend(0, start, end));


Hmm, why using SlicesIterator covers all other cases, e.g., Indices here?

In short because MutableArrayData doesn't support other iteration strategies and I'm not sure that calling it with "ranges" of length 1 will outperform at least attempting to call it with larger ranges.

My 2 cents is that the eventual goal should be to remove this fallback, as MutableArrayData is a poor fit for filtering where the ranges are typically not substantial enough to amortize its per-range overheads effectively

viirya · 2022-02-07T21:33:11Z

A great improvement! Looking forward to have this enhancement.

alamb

Looks great to me; I think it is ready to go

I'll plan to merge to master tomorrow unless there are any more comments

Thank you to everyone who contributed ideas and discussions

alamb · 2022-02-08T19:44:42Z

arrow/src/compute/kernels/filter.rs

+        for i in 0..100 {
+            let filter_percent = match i {
+                0..=4 => 1.,
+                5..=10 => 0.,


TIL match works on ranges!

sunchao

LGTM

alamb · 2022-02-09T21:03:37Z

Thank you to everyone who contributed reviews and comments to this PR and thank you again to @tustvold for sticking with this!

github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jan 30, 2022

tustvold commented Jan 30, 2022

View reviewed changes

tustvold force-pushed the specialized-filter-kernel branch from ab231cf to 8ed556e Compare January 30, 2022 00:12

yordan-pavlov reviewed Jan 31, 2022

View reviewed changes

tustvold added 7 commits February 2, 2022 22:30

Add specialized primitive filter kernels

3a05ebb

Filter context

ba5df9b

Optimize null buffer construction

f233909

Clippy

31d606a

Benchmark filter construction

272e7a6

Review feedback

19cc6ee

Specialized string filter

c5912ef

tustvold force-pushed the specialized-filter-kernel branch from 437e120 to c5912ef Compare February 2, 2022 22:30

Specialized dictionary filter kernel

3ed6b60

tustvold marked this pull request as ready for review February 2, 2022 23:33

tustvold changed the title ~~POC: Specialized filter kernels~~ Specialized filter kernels Feb 2, 2022

tustvold commented Feb 3, 2022

View reviewed changes

Use trusted_len_iter

17f3d80

alamb reviewed Feb 3, 2022

View reviewed changes

tustvold added 2 commits February 4, 2022 21:15

Review feedback

0918e83

Use i64 for chunk offset

1e7481a

yordan-pavlov approved these changes Feb 4, 2022

View reviewed changes

Dandandan reviewed Feb 5, 2022

View reviewed changes

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

Only optimize filter when filtering multiple columns

395f6b3

Dandandan approved these changes Feb 5, 2022

View reviewed changes

Dandandan reviewed Feb 6, 2022

View reviewed changes

jorgecarleitao reviewed Feb 6, 2022

View reviewed changes

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

jorgecarleitao approved these changes Feb 6, 2022

View reviewed changes

sunchao reviewed Feb 6, 2022

View reviewed changes

tustvold added 3 commits February 6, 2022 20:52

Test truncated filter

37fa7f2

Review feedback

2aa46b0

Add IterationStrategy::None

7b4a2f1

viirya reviewed Feb 7, 2022

View reviewed changes

Remove selective / selectivity docs confusion

9fc5fac

alamb approved these changes Feb 8, 2022

View reviewed changes

sunchao approved these changes Feb 8, 2022

View reviewed changes

alamb merged commit c064d53 into apache:master Feb 9, 2022

alamb added the performance label Feb 16, 2022

alamb changed the title ~~Specialized filter kernels~~ Add specialized filter kernels in compute module (up to 10x faster) Feb 16, 2022

tustvold mentioned this pull request Mar 17, 2024

Adaptive Parquet Predicate Pushdown #5523

Open


		let mut buffer = MutableBuffer::with_capacity(filter_count * T::get_byte_width());

		let selectivity_frac = filter_count as f64 / filter.len() as f64;

	/// This provides the best performance on all but the most selective predicates, where the
	/// This provides the best performance on all but the least selective predicates (which keep most / all rows), where the

		let iter = SlicesIterator::new(&predicate.filter);
		iter.for_each(\|(start, end)\| mutable.extend(0, start, end));

Add specialized filter kernels in compute module (up to 10x faster) #1248

Add specialized filter kernels in compute module (up to 10x faster) #1248

Conversation

tustvold commented Jan 30, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 30, 2022 • edited Loading

Codecov Report

Dandandan commented Jan 30, 2022

tustvold commented Jan 30, 2022

Choose a reason for hiding this comment

yordan-pavlov commented Jan 31, 2022

tustvold commented Jan 31, 2022

jhorstmann commented Feb 1, 2022

jhorstmann commented Feb 1, 2022

tustvold commented Feb 1, 2022

tustvold commented Feb 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Feb 4, 2022

yordan-pavlov left a comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

Dandandan commented Feb 6, 2022 • edited Loading

Choose a reason for hiding this comment

jorgecarleitao commented Feb 6, 2022

jorgecarleitao left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Feb 6, 2022

tustvold commented Feb 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

viirya commented Feb 7, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

alamb commented Feb 9, 2022

Add specialized filter kernels in `compute` module (up to 10x faster) #1248

Add specialized filter kernels in `compute` module (up to 10x faster) #1248

tustvold commented Jan 30, 2022 •

edited by alamb

Loading

codecov-commenter commented Jan 30, 2022 •

edited

Loading

tustvold commented Feb 2, 2022 •

edited

Loading

tustvold Feb 3, 2022 •

edited

Loading

Dandandan commented Feb 6, 2022 •

edited

Loading

jorgecarleitao left a comment •

edited

Loading

tustvold Feb 7, 2022 •

edited

Loading