add specialized InList implementations for common scalar types#18832
add specialized InList implementations for common scalar types#18832adriangb merged 7 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds specialized StaticFilter implementations for common scalar types to optimize IN LIST operations in DataFusion. Previously, only Int32 had a specialized filter; now Int8, Int16, Int64, UInt8, UInt16, UInt32, UInt64, Boolean, Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, and BinaryView all have optimized implementations.
Key changes:
- Introduced two macros (
primitive_static_filter!anddefine_static_filter!) to generate specialized filter implementations, eliminating code duplication - Extended
instantiate_static_filterto route 15 additional data types to their specialized implementations - Refactored the
in_listfunction to useinstantiate_static_filterinstead of defaulting to the genericArrayStaticFilter
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@alamb maybe let's run benchmarks? |
|
Here's what I'm seeing so far:
I think we'd need to add benchmarks for other primitive types. |
|
Thinking about it the trick is probably to avoid the extra |
| } | ||
| (false, false, false) => { | ||
| // no nulls anywhere, not negated | ||
| BooleanArray::from_iter( |
There was a problem hiding this comment.
BooleanBuffer::collect_bool is faster
There was a problem hiding this comment.
Thank you I know you or some other reviewer had pointed this out to me before. I am making a mental note to try to not forget again and keep an eye out for it. Thanks for you patience.
There was a problem hiding this comment.
I do wonder why we don't have faster high-level APIs if this is really important. E.g. BooleanArray::new_false, BooleanArray::new_nulls, BooleanArray::new_true and BooleanArray::collect_bool(size, iterator) or something like that.
There was a problem hiding this comment.
We have been discussing various improvements:
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
Thanks @adriangb -- this seems like an improvement to me
It would be nice if we could reduce some of the duplication in the tests, but I don't think that is a deal breaker 👍
I do think we should cover the no null cases with tests
Do you also plan to make special InList implementation for Utf8/Utf8View/LargeUtf8?
| } | ||
|
|
||
| #[test] | ||
| fn in_list_int8() -> Result<()> { |
There was a problem hiding this comment.
Can we please reduce the duplication in tests here? It seems like we there are like 16 copies of the same test
Reducing the duplication will make it easier to understand what is being covered
| BooleanArray::new(builder.finish(), None) | ||
| } | ||
| (false, false, true) => { | ||
| let values = v.values(); |
There was a problem hiding this comment.
This code appears to be uncovered by tests. I tested using
cargo llvm-cov test --html -p datafusion-physical-expr --lib -- in_lis
Here is the whole report in case that is useful llvm-cov.zip
| } | ||
| fn contains(&self, v: &dyn Array, negated: bool) -> Result<BooleanArray> { | ||
| // Handle dictionary arrays by recursing on the values | ||
| downcast_dictionary_array! { |
There was a problem hiding this comment.
I didn't see any tests for dictionaries 🤔
| } | ||
| (false, false, false) => { | ||
| // no nulls anywhere, not negated | ||
| BooleanArray::from_iter( |
There was a problem hiding this comment.
We have been discussing various improvements:
| } | ||
|
|
||
| #[test] | ||
| fn in_list_utf8_view() -> Result<()> { |
There was a problem hiding this comment.
this PR has tests for utf8 but no changes for those types. Is that your intention?
| } | ||
| (true, _, true) | (false, true, true) => { | ||
| // Either needle or haystack has nulls, negated | ||
| BooleanArray::from_iter(v.iter().map(|value| { |
There was a problem hiding this comment.
It probably would be faster to handle the nulls separately or using set_indices rather than using BooleanArray::from_iter and v.iter etc.
| let values = v.values(); | ||
| let mut builder = BooleanBufferBuilder::new(values.len()); | ||
| for value in values.iter() { | ||
| builder.append(self.values.contains(value)); |
There was a problem hiding this comment.
This unfortunately is slower than collect_bool. I see there is some good discussion on better APIs on apache/arrow-rs#8561
Dandandan
left a comment
There was a problem hiding this comment.
Nice to get some performance back.
Results seem a bit mixed? |
|
Yes. Slowdowns for i32 are concerning. I won’t merge this until it’s all speedups or neutral. I may also make a support PR to add more benchmarks for other types so we can make better comparisons. |
|
I am hoping to look at this PR today or tomorrow and help push it along |
|
I realized this new implementation differs from the existing one in ways which fix some bugs. I opened #19050 to demonstrate those bugs. So maybe lets first fix the bugs and merge the new tests, then we can continue here? |
| let haystack_has_nulls = self.null_count > 0; | ||
|
|
||
| let result = match (v.null_count() > 0, haystack_has_nulls, negated) { | ||
| (true, _, false) | (false, true, false) => { |
There was a problem hiding this comment.
You can simplify this slightly complex 6 boolean match with:
let has_nulls = v.null_count() > 0 || haystack_has_nulls;
match (has_nulls, negated) {
(true, false) => { /* Either needle or haystack has nulls, not negated */ }
(true, true) => { /* Either needle or haystack has nulls, negated */ }
(false, false) => { /* no nulls anywhere, not negated */ }
(false, true) => { /* no nulls anywhere, negated */ }
}|
General comment on the benchmark but... am I reading them wrong, or is the fn do_benches(
c: &mut Criterion,
array_length: usize,
in_list_length: usize,
null_percent: f64,
) {
...
let values: Int32Array = (0..array_length)
.map(|_| rng.random_bool(null_percent).then(|| rng.random()))
.collect();
let in_list: Vec<_> = (0..in_list_length)
.map(|_| ScalarValue::Int32(Some(rng.random())))
.collect();
do_bench(
c,
&format!("in_list_i32 ({array_length}, {null_percent}) IN ({in_list_length}, 0)"),
Arc::new(values),
&in_list,
)When I think it should rather be something like: fn do_benches(
c: &mut Criterion,
array_length: usize,
in_list_length: usize,
null_percent: f64,
) {
let non_null_percent = 1.0 - null_percent;
...
let values: Int32Array = (0..array_length)
.map(|_| rng.random_bool(non_null_percent).then(|| rng.random()))
.collect();
let in_list: Vec<_> = (0..in_list_length)
.map(|_| ScalarValue::Int32(Some(rng.random())))
.collect();
do_bench(
c,
&format!("in_list_i32 ({array_length}, {null_percent}) IN ({in_list_length}, 0)"),
Arc::new(values),
&in_list,
)EDIT: I opened #19204 to fix this. |
|
🤖: Benchmark completed Details
|
|
Seems reproducible 😄 |
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
| .ok_or_else(|| exec_datafusion_err!("Failed to downcast array"))?; | ||
| impl PartialEq for OrderedFloat32 { | ||
| fn eq(&self, other: &Self) -> bool { | ||
| self.0.total_cmp(&other.0).is_eq() |
There was a problem hiding this comment.
Doesn't equality on floats use to_bits() elsewhere? This could lead to inconsistencies for edge cases (NaN, +0, -0 for instance.)
There was a problem hiding this comment.
I added tests for all of those and cross references against postgres/duckdb. But I will try to_bits().
| needle_nulls.cloned() | ||
| } | ||
| (false, true) => { | ||
| // Only haystack has nulls - null where not-in-set |
There was a problem hiding this comment.
This comment block is a bit verbose and not super clear. Maybe having it in table form would help?
There was a problem hiding this comment.
Thanks, this is way more readable!
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The benchmarks are now looking very good, and this has gone through several rounds of review, all of it addressed. But there are significant changes since the last approvals. @geoffreyclaude could you give another review and approve if you think it's ready so we can merge this piece and continue work in #19241 |
The tests you added as a dedicated PR give a lot of confidence this isn't introducing any functional regression. And perf-wise, then benchmarks speak for themselves! Do we have more generic benches that exercise this path? Maybe the Clickbench ones? Would be nice to see the big picture and prove this isn't "benchmaxing" with unexpected adverse effects in real life. Otherwise, ✅ from me of course. You addressed my remaining two nits already. |
|
I'll kick off some general benchmarks and if those look good and there's no more feedback I'll merge later today or tomorrow morning. |
|
run benchmark clickbench_partitioned tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Benchmarks look good to me! |
| DataType::UInt64 => Ok(Arc::new(UInt64StaticFilter::try_new(&in_array)?)), | ||
| // Float primitive types (use ordered wrappers for Hash/Eq) | ||
| DataType::Float32 => Ok(Arc::new(Float32StaticFilter::try_new(&in_array)?)), | ||
| DataType::Float64 => Ok(Arc::new(Float64StaticFilter::try_new(&in_array)?)), |
There was a problem hiding this comment.
Do we know if adding an optimized version of the binary / string type comparisons is tracked with a ticket?
There was a problem hiding this comment.
No but I don't think that was optimized before and there's less opportunity to optimize because of the copy price. I'm going to leave that for #19241 where there's already a lot of great ideas if that's okay

Closes #18824