Add additional tests for InListExpr by adriangb · Pull Request #19050 · apache/datafusion

adriangb · 2025-12-02T18:12:18Z

This adds tests that prove some bugs that are fixed in #18832 so we can level the playing field on benchmarks.

adriangb · 2025-12-02T18:12:50Z

I expect tests to fail. We can fix them once we confirm.

adriangb · 2025-12-03T17:38:41Z

datafusion/physical-expr/src/expressions/in_list.rs

+        downcast_dictionary_array! {
+            v => {
+                let values_contains = self.contains(v.values().as_ref(), negated)?;
+                let result = take(&values_contains, v.keys(), None)?;
+                return Ok(downcast_array(result.as_ref()))
+            }
+            _ => {}
+        }


The lack of this was a bug

adriangb · 2025-12-03T17:38:51Z

datafusion/physical-expr/src/expressions/in_list.rs

-        let result = match (v.null_count() > 0, negated) {
-            (true, false) => {
-                // has nulls, not negated"
-                BooleanArray::from_iter(
-                    v.iter().map(|value| Some(self.values.contains(&value?))),


We were not handling nulls properly

i double checked with postgres and the new code looks right

postgres=# select NULL IN (1); ?column? ---------- (1 row) postgres=# select 1 IN (1, NULL); ?column? ---------- t (1 row) postgres=# select 1 IN (2, NULL); ?column? ---------- (1 row) postgres=# select 1 NOT IN (2, NULL); ?column? ---------- (1 row)

yep I checked postgres and duckdb to get the right behavior

adriangb · 2025-12-08T18:12:29Z

@alamb I've added test harnesses to remove duplication as you requested 😄

alamb

Thank you @adriangb -- I went through the code fixes and tests and this PR looks like a definitely improvement to me

I left some small suggestions

I also ran code coverage on this PR to double check and it did find one more case that is not covered

cargo llvm-cov test --html -p datafusion-physical-expr

alamb · 2025-12-09T11:32:47Z

datafusion/physical-expr/src/expressions/in_list.rs

+    /// This validates data types, evaluates the list as constants, and uses specialized
+    /// StaticFilter implementations for better performance (e.g., Int32StaticFilter for Int32).
+    ///
+    /// Returns an error if data types don't match or if the list contains non-constant expressions.


I don't think this comment is correct -- the code appears to handle the case of non-constant expressions as well

I think this might also be clearer if it were simply named try_new() rather than try_from_static_filter

thanks updated

alamb · 2025-12-09T12:03:53Z

datafusion/physical-expr/src/expressions/in_list.rs

-                BooleanArray::from_iter(
-                    v.iter().map(|value| Some(self.values.contains(&value?))),
-                )
+                // Either needle or haystack has nulls, not negated


Suggested change

// Either needle or haystack has nulls, not negated

// needle has nulls, not negated

alamb · 2025-12-09T12:05:40Z

datafusion/physical-expr/src/expressions/in_list.rs

-                BooleanArray::from_iter(
-                    v.iter().map(|value| Some(!self.values.contains(&value?))),
-                )
+                // Either needle or haystack has nulls, negated


Suggested change

// Either needle or haystack has nulls, negated

// needle has nulls, negated

alamb · 2025-12-09T12:08:03Z

datafusion/physical-expr/src/expressions/in_list.rs

+                negated,
+                Some(instantiate_static_filter(in_array)?),
+            )),
+            Err(_) => {


nit -- I think it would be clearer if this try_evaluate_constant_list evaluated to Result<Option<..>> rather than Result -- that way we avoid the string allocation on error, and could pass real errors .

This code responds to all errors, even if the issue is something different than the types are not supported

If you went this route, I suspect you wouldn't need the second error check for type matches (it would already be handled)

renamed and refactored to Result<Option<>> as you suggested

alamb · 2025-12-09T12:14:32Z

datafusion/physical-expr/src/expressions/in_list.rs

+        name: &'static str,
+        value_in: ScalarValue,
+        value_not_in: ScalarValue,
+        value_in_list: ScalarValue,


my reading of these tests is that they always test a one element list -- as in the list does not have multiple values

Do you think we need coverage for multi-value lists?

The previous tests all had mutli-value lists, such as

// expression: "a not in ("a", "b")"

Maybe we could make this something like values_in: Vec<ScalarValue>

However, I see there are multi-value tests below too, so maybe this is ok

I updated to accept other_list_values: Vec<ScalarValue> and have the tests throw in some extra values to give coverage

alamb · 2025-12-09T12:22:10Z

datafusion/physical-expr/src/expressions/in_list.rs

-        let result = match (v.null_count() > 0, negated) {
-            (true, false) => {
-                // has nulls, not negated"
-                BooleanArray::from_iter(
-                    v.iter().map(|value| Some(self.values.contains(&value?))),


i double checked with postgres and the new code looks right

postgres=# select NULL IN (1); ?column? ---------- (1 row) postgres=# select 1 IN (1, NULL); ?column? ---------- t (1 row) postgres=# select 1 IN (2, NULL); ?column? ---------- (1 row) postgres=# select 1 NOT IN (2, NULL); ?column? ---------- (1 row)

alamb · 2025-12-09T12:24:05Z

datafusion/physical-expr/src/expressions/in_list.rs

+        // Create dictionary-encoded batch with values [1, 2, 5]
+        // Dictionary: keys [0, 1, 2] -> values [1, 2, 5]
+        let keys = Int8Array::from(vec![0, 1, 2]);
+        let values = Int32Array::from(vec![1, 2, 5]);


I personally recommend using values that are clearly not the keys - for example

Suggested change

let values = Int32Array::from(vec![1, 2, 5]);

let values = Int32Array::from(vec![100, 200, 500]);

( you would also have to change the literals above)

alamb · 2025-12-09T12:27:18Z

datafusion/physical-expr/src/expressions/in_list.rs

            }
            (false, true) => {
-                // no null, negated
+                // No nulls anywhere, negated


according to code coverage, this branch is not covered by tests (see PR comments)

alamb · 2025-12-09T12:28:35Z

datafusion/physical-expr/src/expressions/in_list.rs

+                RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::clone(&array)])?;
+
+            // Helper to format SQL-like representation for error messages
+            let _format_sql = |negated: bool, with_null: bool| -> String {


this helper seems unused (why is it named starting with _ 🤔 )

it was a pipe dream to have nice error messages. I updated it and implemented it. now if a test fails you get something like:

assertion `left == right` failed: Failed for: a IN (5, 3, 7) a: PrimitiveArray<Int32> [ 1, 3, 4, ] left: BooleanArray [ true, true, true, ] right: BooleanArray [ false, true, false, ]

github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 2, 2025

adriangb mentioned this pull request Dec 2, 2025

add specialized InList implementations for common scalar types #18832

Merged

adriangb commented Dec 3, 2025

View reviewed changes

adriangb force-pushed the add-in-list-tests branch 3 times, most recently from 76fd945 to f2eb513 Compare December 8, 2025 17:54

adriangb mentioned this pull request Dec 8, 2025

Short InList Optimization pydantic/datafusion#46

Merged

adriangb force-pushed the add-in-list-tests branch from 772bca6 to b3fca6a Compare December 9, 2025 03:00

alamb mentioned this pull request Dec 9, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-12-08 #19210

Closed

40 tasks

alamb approved these changes Dec 9, 2025

View reviewed changes

adriangb added 14 commits December 9, 2025 07:10

add additional in-list tests:

1f3e113

refactor and show bugs

15f4193

refactor

b4b23ed

fixes

c0ce5ac

remove comment

e381c9e

lint

5d63247

Add a test harness for primivite types

b89ed19

further consolidate

71d747e

refactor helpers

1878aaa

add dictionary tests

0ef15f2

more dict test cases

488b16f

include test name in errors

f13c308

lint, apply apache#18832 (comment)

7a8b4ce

add null testing, address pr feedback

89234b1

adriangb force-pushed the add-in-list-tests branch from b3fca6a to 89234b1 Compare December 9, 2025 15:02

fix lint

6a499ae

adriangb added this pull request to the merge queue Dec 9, 2025

Merged via the queue into apache:main with commit 21a16e4 Dec 9, 2025
31 checks passed

adriangb deleted the add-in-list-tests branch December 9, 2025 17:18

	// Either needle or haystack has nulls, not negated
	// needle has nulls, not negated

	// Either needle or haystack has nulls, negated
	// needle has nulls, negated

	let values = Int32Array::from(vec![1, 2, 5]);
	let values = Int32Array::from(vec![100, 200, 500]);

Conversation

adriangb commented Dec 2, 2025

Uh oh!

adriangb commented Dec 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants