Conversation
- Add LargeStringArray benchmarks alongside existing StringArray benchmarks - Use explicit ScalarValue::Utf8 for StringArray (was using ScalarValue::from which creates Utf8View)
… collect_bool The previous implementation used BooleanArray::from_iter and BooleanBufferBuilder with element-by-element appends, which incur iterator overhead and prevent vectorization. This commit switches to BooleanBuffer::collect_bool, a batch operation that pre-allocates the exact buffer size and enables SIMD optimization. Since collect_bool guarantees the index is always in bounds, we can safely use unchecked array access (value_unchecked, get_unchecked) to eliminate bounds checks in the hot loop. The null-handling match is also simplified from a 3-way tuple to a 2-way check by pre-combining needle and haystack null flags.
…ings For small IN lists (≤8 elements), hashing overhead dominates execution time. This commit uses binary search instead, which is faster for small lists. Utf8View gains a short-string filter that compares raw u128 views directly - the same layout Arrow uses for inline storage (≤12 bytes). This turns string comparison into fast integer comparison. Lists with long strings fall through to the generic hash-based filter. Benchmarks show significant improvement for Utf8View short strings and primitives with small lists.
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use arrow::array::{Array, ArrayRef, Float32Array, Int32Array, StringArray}; |
There was a problem hiding this comment.
Could you make the benchmarks as a PR to datafusion/main so that we can merge them and have them in main for comparsion?
adriangb
left a comment
There was a problem hiding this comment.
Looks good to me! Great work. Let's merge this into my PR whenever you are ready. If you have the time, could you review apache#19050? It has more tests + fixes bugs in the current implementation. I'd also like to merge that before we do more perf optimization.
I'll give it a look tomorrow morning. Seems you fixed quite a few bugs! Nulls in arrays/lists are always tricky to get right... Feel free to merge my PR whenever you want. Once we've merged the updated benchmark, you can rebase over main to have a clean history and bench baseline. |
Which issue does this PR close?
Rationale for this change
The
INlist evaluation is a hot path in query execution. Profiling revealed two optimization opportunities:Hashing overhead dominates for small lists: For lists with ≤8 elements, the cost of hashing exceeds the benefit. Binary search is faster in this regime.
String comparison is expensive: For
Utf8Viewarrays, Arrow stores short strings (≤12 bytes) inline as a 128-bit view. We can compare these views directly as integers instead of doing byte-by-byte string comparison.What changes are included in this PR?
Small list optimization (≤8 elements):
SortedLookup<T>using binary search instead ofHashedLookup<T>for primitive types when list size ≤8Utf8View short-string optimization:
Utf8ViewSortedFilterandUtf8ViewHashedFilterthat convert strings to their rawu128view representationAre these changes tested?
Yes, covered by existing
in_listtests which exercise all data types, list sizes, and null percentages.Are there any user-facing changes?
No API changes. Queries using
INlists will execute faster, especially for:Utf8Viewcolumns with short stringsBenchmark Results
Benchmarks run on 1024-row arrays with varying list sizes, null percentages, and string lengths.
Results are only on the last commit of the PR.
Utf8View Short Strings (≤12 bytes) — 60-70% faster
Primitives with Small Lists — 30-65% faster
Regressions (large lists, long strings)
These regressions are in less common patterns (large lists with long strings) and are outweighed by the gains in typical use cases.