Short InList Optimization by geoffreyclaude · Pull Request #46 · pydantic/datafusion

geoffreyclaude · 2025-12-08T18:49:41Z

Which issue does this PR close?

Closes #.

Rationale for this change

The IN list evaluation is a hot path in query execution. Profiling revealed two optimization opportunities:

Hashing overhead dominates for small lists: For lists with ≤8 elements, the cost of hashing exceeds the benefit. Binary search is faster in this regime.
String comparison is expensive: For Utf8View arrays, Arrow stores short strings (≤12 bytes) inline as a 128-bit view. We can compare these views directly as integers instead of doing byte-by-byte string comparison.

What changes are included in this PR?

Small list optimization (≤8 elements):

Introduces SortedLookup<T> using binary search instead of HashedLookup<T> for primitive types when list size ≤8
Separate filter structs ensure static dispatch (no runtime branch in the hot loop)

Utf8View short-string optimization:

New Utf8ViewSortedFilter and Utf8ViewHashedFilter that convert strings to their raw u128 view representation
Strings ≤12 bytes become fast integer comparisons

Are these changes tested?

Yes, covered by existing in_list tests which exercise all data types, list sizes, and null percentages.

Are there any user-facing changes?

No API changes. Queries using IN lists will execute faster, especially for:

Utf8View columns with short strings
Primitive columns with small IN lists
Columns with null values

Benchmark Results

Benchmarks run on 1024-row arrays with varying list sizes, null percentages, and string lengths.

Results are only on the last commit of the PR.

Utf8View Short Strings (≤12 bytes) — 60-70% faster

Benchmark	Before	After	Change
list=3/nulls=0%/str=3	6.87 µs	2.18 µs	-68%
list=3/nulls=0%/str=12	7.00 µs	2.19 µs	-69%
list=8/nulls=0%/str=3	7.30 µs	2.79 µs	-62%
list=8/nulls=20%/str=12	6.73 µs	2.92 µs	-57%
list=100/nulls=0%/str=3	7.25 µs	2.47 µs	-66%
list=100/nulls=20%/str=12	7.38 µs	2.57 µs	-66%

Primitives with Small Lists — 30-65% faster

Benchmark	Before	After	Change
Float32/list=3/nulls=0%	5.40 µs	2.20 µs	-59%
Float32/list=8/nulls=0%	5.36 µs	2.94 µs	-46%
Int32/list=3/nulls=0%	2.05 µs	1.41 µs	-32%
Int32/list=3/nulls=20%	4.30 µs	1.56 µs	-64%
Int32/list=8/nulls=20%	4.35 µs	2.25 µs	-48%
Int32/list=100/nulls=20%	4.56 µs	2.04 µs	-56%

Regressions (large lists, long strings)

Benchmark	Before	After	Change
Utf8View/list=100/str=100	12.89 µs	13.95 µs	+8%
Float32/list=100/nulls=0%	5.55 µs	5.69 µs	+3%

These regressions are in less common patterns (large lists with long strings) and are outweighed by the gains in typical use cases.

- Add LargeStringArray benchmarks alongside existing StringArray benchmarks - Use explicit ScalarValue::Utf8 for StringArray (was using ScalarValue::from which creates Utf8View)

… collect_bool The previous implementation used BooleanArray::from_iter and BooleanBufferBuilder with element-by-element appends, which incur iterator overhead and prevent vectorization. This commit switches to BooleanBuffer::collect_bool, a batch operation that pre-allocates the exact buffer size and enables SIMD optimization. Since collect_bool guarantees the index is always in bounds, we can safely use unchecked array access (value_unchecked, get_unchecked) to eliminate bounds checks in the hot loop. The null-handling match is also simplified from a 3-way tuple to a 2-way check by pre-combining needle and haystack null flags.

…ings For small IN lists (≤8 elements), hashing overhead dominates execution time. This commit uses binary search instead, which is faster for small lists. Utf8View gains a short-string filter that compares raw u128 views directly - the same layout Arrow uses for inline storage (≤12 bytes). This turns string comparison into fast integer comparison. Lists with long strings fall through to the generic hash-based filter. Benchmarks show significant improvement for Utf8View short strings and primitives with small lists.

adriangb · 2025-12-08T18:58:48Z

datafusion/physical-expr/benches/in_list.rs

 // specific language governing permissions and limitations
 // under the License.

-use arrow::array::{Array, ArrayRef, Float32Array, Int32Array, StringArray};


Could you make the benchmarks as a PR to datafusion/main so that we can merge them and have them in main for comparsion?

Nvm I see this is apache#19211 😄

I opened it here already :) apache#19211

adriangb

Looks good to me! Great work. Let's merge this into my PR whenever you are ready. If you have the time, could you review apache#19050? It has more tests + fixes bugs in the current implementation. I'd also like to merge that before we do more perf optimization.

geoffreyclaude · 2025-12-08T19:05:35Z

Looks good to me! Great work. Let's merge this into my PR whenever you are ready. If you have the time, could you review apache#19050? It has more tests + fixes bugs in the current implementation. I'd also like to merge that before we do more perf optimization.

I'll give it a look tomorrow morning. Seems you fixed quite a few bugs! Nulls in arrays/lists are always tricky to get right...

Feel free to merge my PR whenever you want. Once we've merged the updated benchmark, you can rebase over main to have a clean history and bench baseline.

This reverts commit d299a91.

geoffreyclaude added 4 commits December 8, 2025 16:54

fix: inverted null_percent logic in in_list benchmark

b57f493

bench: add Utf8 and LargeUtf8 benchmarks for InList

f1f064b

- Add LargeStringArray benchmarks alongside existing StringArray benchmarks - Use explicit ScalarValue::Utf8 for StringArray (was using ScalarValue::from which creates Utf8View)

github-actions bot added the physical-expr label Dec 8, 2025

geoffreyclaude mentioned this pull request Dec 8, 2025

add specialized InList implementations for common scalar types apache/datafusion#18832

Merged

adriangb marked this pull request as ready for review December 8, 2025 18:58

adriangb reviewed Dec 8, 2025

View reviewed changes

adriangb approved these changes Dec 8, 2025

View reviewed changes

adriangb merged commit d299a91 into pydantic:specialize Dec 8, 2025
4 checks passed

adriangb added a commit that referenced this pull request Dec 9, 2025

Revert "Short InList Optimization (#46)"

d8f6d45

This reverts commit d299a91.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short InList Optimization#46

Short InList Optimization#46
adriangb merged 4 commits intopydantic:specializefrom
geoffreyclaude:perf/in_list

geoffreyclaude commented Dec 8, 2025 •

edited

Loading

Uh oh!

adriangb Dec 8, 2025

Uh oh!

adriangb Dec 8, 2025

Uh oh!

geoffreyclaude Dec 8, 2025

Uh oh!

adriangb left a comment

Uh oh!

geoffreyclaude commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geoffreyclaude commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Benchmark Results

Utf8View Short Strings (≤12 bytes) — 60-70% faster

Primitives with Small Lists — 30-65% faster

Regressions (large lists, long strings)

Uh oh!

adriangb Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geoffreyclaude commented Dec 8, 2025 •

edited

Loading