Fix string view LIKE checks with NULL values #6662

findepi · 2024-10-31T15:25:09Z

Which issue does this PR close?

for Incorrect LIKE and ILIKE result for NULL input, `%', and '%%' pattern datafusion#12637 (comment)

Rationale for this change

Fix correctness

What changes are included in this PR?

Fix StringView LIKE .. result when tested values include nulls.

Are there any user-facing changes?

Yes

findepi · 2024-10-31T15:41:59Z

cc @goldmedal

alamb

Thanks @findepi -- this is looking good. Nice find.

alamb · 2024-11-01T16:10:06Z

arrow-string/src/predicate.rs

-                            .map(|haystack| finder.find(haystack).is_some() != negate)
+                            .enumerate()
+                            .map(|(idx, haystack)| {
+                                if nulls.as_ref().map(|n| n.is_null(idx)).unwrap_or_default() {


my concern with this approach is that it adds code to the inner loop of the kernel.

We should definitely run benchmarks on this PR before merging

I think another way to solve this would be to either:

Have two loops -- one when nulls was Some and the other when nulls was None (so we aren't checking each row)

Leave the computation as is (don't check for nulls on the input) but instead simply copy the Nulls array from the input to the output (so any input that was null would also be null in the output)

FWIW I am not sure why this isn't using from_unary like the other codepaths which would provide option 2, and be significantly more efficient than first collecting into a Vec...

For now i am more concerned about correctness. Producing fast but wrong results isn't as useful.
i changed contains to to use from_unary, that seemed trivially good.
@tustvold might this be it doesn't use from_unary because it wants to use prefix_bytes_iter and suffix_bytes_iter? Please advise! I think we can switch all operations to from_unary to fix correctness and leave tuning performance to a follow-up.

I am running some benchmarks to see what, if any, impact this has in performance

cargo bench --bench comparison_kernels -- like

I will report back shortly

ilike_utf8 scalar complex 1.00 2.7±0.05ms ? ?/sec 1.03 2.8±0.09ms ? ?/sec ilike_utf8 scalar contains 1.01 4.2±0.05ms ? ?/sec 1.00 4.2±0.07ms ? ?/sec ilike_utf8 scalar ends with 1.02 1279.8±38.11µs ? ?/sec 1.00 1258.4±37.50µs ? ?/sec ilike_utf8 scalar equals 1.03 801.1±23.96µs ? ?/sec 1.00 775.4±20.05µs ? ?/sec ilike_utf8 scalar starts with 1.00 1165.9±38.61µs ? ?/sec 1.02 1193.0±54.84µs ? ?/sec ilike_utf8_scalar_dyn dictionary[10] string[4]) 1.00 77.7±0.09µs ? ?/sec 1.00 77.6±0.07µs ? ?/sec like_utf8 scalar complex 1.00 1924.8±43.83µs ? ?/sec 1.00 1924.7±49.77µs ? ?/sec like_utf8 scalar contains 1.00 1562.1±14.30µs ? ?/sec 1.00 1567.9±23.23µs ? ?/sec like_utf8 scalar ends with 1.02 432.7±11.70µs ? ?/sec 1.00 425.2±5.10µs ? ?/sec like_utf8 scalar equals 1.16 127.0±0.17µs ? ?/sec 1.00 109.6±0.20µs ? ?/sec like_utf8 scalar starts with 1.00 339.2±4.42µs ? ?/sec 1.02 347.4±15.72µs ? ?/sec like_utf8_scalar_dyn dictionary[10] string[4]) 1.00 77.6±0.20µs ? ?/sec 1.00 77.4±0.12µs ? ?/sec like_utf8view scalar complex 1.00 182.8±0.58ms ? ?/sec 1.01 184.9±0.67ms ? ?/sec like_utf8view scalar contains 1.02 141.3±0.34ms ? ?/sec 1.00 138.2±0.32ms ? ?/sec like_utf8view scalar ends with 13 bytes 1.49 66.4±0.42ms ? ?/sec 1.00 44.7±0.27ms ? ?/sec like_utf8view scalar ends with 4 bytes 1.45 66.1±0.33ms ? ?/sec 1.00 45.4±0.35ms ? ?/sec like_utf8view scalar ends with 6 bytes 1.46 66.0±0.35ms ? ?/sec 1.00 45.2±0.32ms ? ?/sec like_utf8view scalar equals 1.00 33.9±0.12ms ? ?/sec 1.11 37.8±0.10ms ? ?/sec like_utf8view scalar starts with 13 bytes 1.55 68.6±0.36ms ? ?/sec 1.00 44.2±0.33ms ? ?/sec like_utf8view scalar starts with 4 bytes 1.57 42.6±0.18ms ? ?/sec 1.00 27.2±0.13ms ? ?/sec like_utf8view scalar starts with 6 bytes 1.53 69.0±0.32ms ? ?/sec 1.00 45.2±0.28ms ? ?/sec nilike_utf8 scalar complex 1.00 2.7±0.07ms ? ?/sec 1.00 2.7±0.06ms ? ?/sec nilike_utf8 scalar contains 1.02 4.2±0.05ms ? ?/sec 1.00 4.1±0.05ms ? ?/sec nilike_utf8 scalar ends with 1.01 1259.7±41.60µs ? ?/sec 1.00 1247.7±26.01µs ? ?/sec nilike_utf8 scalar equals 1.05 801.3±25.80µs ? ?/sec 1.00 765.8±17.79µs ? ?/sec nilike_utf8 scalar starts with 1.02 1171.4±33.96µs ? ?/sec 1.00 1146.6±33.64µs ? ?/sec nlike_utf8 scalar complex 1.00 1918.4±31.09µs ? ?/sec 1.00 1924.4±43.22µs ? ?/sec nlike_utf8 scalar contains 1.00 1566.2±19.53µs ? ?/sec 1.00 1566.7±16.01µs ? ?/sec nlike_utf8 scalar ends with 1.00 428.5±5.39µs ? ?/sec 1.02 438.5±14.26µs ? ?/sec nlike_utf8 scalar equals 1.16 127.0±0.15µs ? ?/sec 1.00 109.5±0.12µs ? ?/sec nlike_utf8 scalar starts with 1.07 359.3±21.26µs ? ?/sec 1.00 337.2±4.81µs ? ?/sec

It looks like there are some non trivial regressions (50%) or utf8view with scalar starts/ends.

I did run benchmarks locally too. Thanks @alamb for the instructions.

First, on the commit 1621350 where the PR branch starts

git checkout 16213501769371c7d2d6ce299755337b6db4d8fd caffeinate -ds cargo bench --bench comparison_kernels -- --save-baseline main like

then on 1e82884, the tip of the PR branch

git checkout findepi/fix-string-view-like-checks-with-null-values-1121b8 caffeinate -ds cargo bench --bench comparison_kernels -- --baseline main like

This command runs bechmark and produces comparison. Let me omit low changes (below 9%) for brevity, as well as outliers reporting.

significant performance regressions:

like_utf8view scalar ends with 4 bytes time: [36.885 ms 36.925 ms 36.972 ms] change: [+77.568% +77.857% +78.161%] (p = 0.00 < 0.05) Performance has regressed. like_utf8view scalar ends with 6 bytes time: [36.761 ms 36.818 ms 36.881 ms] change: [+77.369% +77.776% +78.195%] (p = 0.00 < 0.05) Performance has regressed. like_utf8view scalar ends with 13 bytes time: [36.537 ms 36.613 ms 36.694 ms] change: [+78.308% +79.019% +79.752%] (p = 0.00 < 0.05) Performance has regressed. like_utf8view scalar starts with 4 bytes time: [25.490 ms 25.542 ms 25.596 ms] change: [+112.76% +113.25% +113.76%] (p = 0.00 < 0.05) Performance has regressed. like_utf8view scalar starts with 6 bytes time: [36.556 ms 36.598 ms 36.645 ms] change: [+84.943% +85.578% +86.205%] (p = 0.00 < 0.05) Performance has regressed. like_utf8view scalar starts with 13 bytes time: [37.065 ms 37.105 ms 37.152 ms] change: [+86.735% +87.424% +88.116%] (p = 0.00 < 0.05) Performance has regressed.

significant performance improvements:

ilike_utf8 scalar equals time: [475.56 µs 476.17 µs 476.93 µs] change: [-47.498% -47.406% -47.310%] (p = 0.00 < 0.05) Performance has improved. ilike_utf8 scalar contains time: [2.4461 ms 2.4500 ms 2.4552 ms] change: [-15.221% -15.008% -14.786%] (p = 0.00 < 0.05) Performance has improved. ilike_utf8 scalar ends with time: [546.82 µs 547.76 µs 548.86 µs] change: [-43.736% -43.605% -43.466%] (p = 0.00 < 0.05) Performance has improved. ilike_utf8 scalar starts with time: [560.40 µs 561.46 µs 562.67 µs] change: [-43.349% -42.997% -42.641%] (p = 0.00 < 0.05) Performance has improved. Benchmarking ilike_utf8 scalar complex: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s, enable flat sampling, or reduce sample count to 60. ilike_utf8 scalar complex time: [1.0629 ms 1.0637 ms 1.0649 ms] change: [-29.556% -29.299% -29.062%] (p = 0.00 < 0.05) Performance has improved. nilike_utf8 scalar equals time: [475.34 µs 476.23 µs 477.67 µs] change: [-46.418% -46.259% -46.081%] (p = 0.00 < 0.05) Performance has improved. nilike_utf8 scalar contains time: [2.4427 ms 2.4449 ms 2.4477 ms] change: [-15.899% -15.421% -15.120%] (p = 0.00 < 0.05) Performance has improved. nilike_utf8 scalar ends with time: [548.31 µs 548.96 µs 549.63 µs] change: [-42.924% -42.742% -42.554%] (p = 0.00 < 0.05) Performance has improved. nilike_utf8 scalar starts with time: [557.88 µs 558.54 µs 559.35 µs] change: [-42.683% -42.392% -42.144%] (p = 0.00 < 0.05) Performance has improved. Benchmarking nilike_utf8 scalar complex: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s, enable flat sampling, or reduce sample count to 60. nilike_utf8 scalar complex time: [1.0648 ms 1.0669 ms 1.0696 ms] change: [-28.834% -28.600% -28.353%] (p = 0.00 < 0.05) Performance has improved.

TL;DR
i must be doing something wrong here. the benchmarks report improvement for code path i didn't change.

the results are apparently repeatable on my laptop

okay, easy to fix, actually

pushed performance-oriented changes
i still observe a regression for one case

like_utf8view scalar starts with 4 bytes time: [12.899 ms 12.939 ms 12.986 ms] change: [+7.7003% +8.0298% +8.4320%] (p = 0.00 < 0.05) Performance has regressed.

and still see improvements for many ilike cases e.g.

nilike_utf8 scalar ends with time: [547.64 µs 548.55 µs 549.63 µs] change: [-42.656% -42.257% -41.801%] (p = 0.00 < 0.05) Performance has improved.

i don't think i know how to move it further from here. @alamb @tustvold ptal

Checking it out

from_unary takes care of nulls

findepi · 2024-11-08T14:22:58Z

cc @Dandandan @crepererum

alamb

Thank you @findepi -- this looks good to me. I am also rerunning the benchmarks to be sure

alamb · 2024-11-08T16:57:12Z

arrow-string/src/predicate.rs

@@ -146,6 +138,7 @@ impl<'a> Predicate<'a> {
            }
            Predicate::IStartsWithAscii(v) => {
                if let Some(string_view_array) = array.as_any().downcast_ref::<StringViewArray>() {
+                    // TODO respect null buffer


These are good todos -- I will file a ticket to handle them

i plan to take care of this -- for apache/datafusion#12637

i created #6705 for this

alamb · 2024-11-08T16:58:53Z

Here are my benchmark results. My conclusion is that there is some non trivial variability in the benchmarks but I don't think this PR does anything substantial

group                                              findepi_fix-string-view-like-checks-with-null-values-1121b8    master
-----                                              -----------------------------------------------------------    ------
ilike_utf8 scalar complex                          1.03      2.8±0.07ms        ? ?/sec                            1.00      2.7±0.06ms        ? ?/sec
ilike_utf8 scalar contains                         1.04      4.3±0.06ms        ? ?/sec                            1.00      4.1±0.05ms        ? ?/sec
ilike_utf8 scalar ends with                        1.00  1255.7±49.78µs        ? ?/sec                            1.00  1250.3±33.44µs        ? ?/sec
ilike_utf8 scalar equals                           1.00   729.4±23.72µs        ? ?/sec                            1.07   781.1±24.73µs        ? ?/sec
ilike_utf8 scalar starts with                      1.01  1151.3±46.67µs        ? ?/sec                            1.00  1144.6±35.48µs        ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4])    1.00     77.9±0.30µs        ? ?/sec                            1.00     77.6±0.07µs        ? ?/sec
like_utf8 scalar complex                           1.02  1952.8±24.11µs        ? ?/sec                            1.00  1909.3±20.85µs        ? ?/sec
like_utf8 scalar contains                          1.10  1728.7±21.70µs        ? ?/sec                            1.00  1565.7±13.97µs        ? ?/sec
like_utf8 scalar ends with                         1.00    421.4±9.37µs        ? ?/sec                            1.01    424.1±7.08µs        ? ?/sec
like_utf8 scalar equals                            1.00    109.7±0.29µs        ? ?/sec                            1.16    127.1±0.33µs        ? ?/sec
like_utf8 scalar starts with                       1.00   345.5±12.23µs        ? ?/sec                            1.00   346.2±11.37µs        ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4])     1.00     77.6±0.22µs        ? ?/sec                            1.00     77.5±0.10µs        ? ?/sec
like_utf8view scalar complex                       1.06    195.4±0.68ms        ? ?/sec                            1.00    183.9±0.71ms        ? ?/sec
like_utf8view scalar contains                      1.07    148.5±0.33ms        ? ?/sec                            1.00    138.8±0.20ms        ? ?/sec
like_utf8view scalar ends with 13 bytes            1.00     43.0±0.30ms        ? ?/sec                            1.11     47.7±0.17ms        ? ?/sec
like_utf8view scalar ends with 4 bytes             1.00     43.7±0.30ms        ? ?/sec                            1.11     48.7±0.12ms        ? ?/sec
like_utf8view scalar ends with 6 bytes             1.00     43.5±0.29ms        ? ?/sec                            1.11     48.5±0.16ms        ? ?/sec
like_utf8view scalar equals                        1.00     33.6±0.20ms        ? ?/sec                            1.07     35.8±0.07ms        ? ?/sec
like_utf8view scalar starts with 13 bytes          1.00     46.6±0.28ms        ? ?/sec                            1.01     47.2±0.15ms        ? ?/sec
like_utf8view scalar starts with 4 bytes           1.00     33.2±0.11ms        ? ?/sec                            1.03     34.3±0.18ms        ? ?/sec
like_utf8view scalar starts with 6 bytes           1.00     47.1±0.24ms        ? ?/sec                            1.02     47.8±0.21ms        ? ?/sec
nilike_utf8 scalar complex                         1.06      2.9±0.07ms        ? ?/sec                            1.00      2.7±0.06ms        ? ?/sec
nilike_utf8 scalar contains                        1.04      4.3±0.07ms        ? ?/sec                            1.00      4.1±0.05ms        ? ?/sec
nilike_utf8 scalar ends with                       1.00  1228.8±29.01µs        ? ?/sec                            1.00  1229.4±21.51µs        ? ?/sec
nilike_utf8 scalar equals                          1.00   735.4±24.25µs        ? ?/sec                            1.05   773.3±12.76µs        ? ?/sec
nilike_utf8 scalar starts with                     1.05  1174.5±40.82µs        ? ?/sec                            1.00  1121.1±17.17µs        ? ?/sec
nlike_utf8 scalar complex                          1.02  1950.7±31.49µs        ? ?/sec                            1.00  1907.6±24.54µs        ? ?/sec
nlike_utf8 scalar contains                         1.10  1728.3±15.41µs        ? ?/sec                            1.00  1568.6±15.88µs        ? ?/sec
nlike_utf8 scalar ends with                        1.00    420.2±8.39µs        ? ?/sec                            1.00    421.9±7.39µs        ? ?/sec
nlike_utf8 scalar equals                           1.00    109.6±0.28µs        ? ?/sec                            1.16    126.9±0.13µs        ? ?/sec
nlike_utf8 scalar starts with                      1.00    344.2±6.52µs        ? ?/sec                            1.03   355.1±17.11µs        ? ?/sec

* Fix string view LIKE checks with NULL values * Add TODO comments to operations yet to be fixed * Use from_unary for contains with string view from_unary takes care of nulls * Fix performance * fixup! Fix performance

Fix string view LIKE checks with NULL values

d3e42a5

github-actions bot added the arrow Changes to the arrow crate label Oct 31, 2024

findepi mentioned this pull request Oct 31, 2024

Incorrect LIKE and ILIKE result for NULL input, `%', and '%%' pattern apache/datafusion#12637

Open

alamb reviewed Nov 1, 2024

View reviewed changes

findepi added 2 commits November 1, 2024 21:37

Add TODO comments to operations yet to be fixed

65d52ff

Use from_unary for contains with string view

1e82884

from_unary takes care of nulls

findepi force-pushed the findepi/fix-string-view-like-checks-with-null-values-1121b8 branch from d92d94a to 1e82884 Compare November 1, 2024 20:42

findepi requested a review from tustvold November 1, 2024 21:24

findepi mentioned this pull request Nov 2, 2024

Support Utf8View for bit_length kernel #6671

Merged

findepi requested a review from alamb November 3, 2024 20:37

findepi added 2 commits November 5, 2024 22:53

Fix performance

bede692

fixup! Fix performance

2394f92

alamb approved these changes Nov 8, 2024

View reviewed changes

alamb merged commit 12ca3f0 into apache:master Nov 8, 2024
18 checks passed

findepi deleted the findepi/fix-string-view-like-checks-with-null-values-1121b8 branch November 8, 2024 17:31

This was referenced Nov 8, 2024

Fix LIKE with escapes #6703

Merged

Fix string view ILIKE checks with NULL values #6705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix string view LIKE checks with NULL values #6662

Fix string view LIKE checks with NULL values #6662

findepi commented Oct 31, 2024

findepi commented Oct 31, 2024

alamb left a comment

alamb Nov 1, 2024

tustvold Nov 1, 2024 •

edited

Loading

findepi Nov 1, 2024

alamb Nov 5, 2024

alamb Nov 5, 2024

findepi Nov 5, 2024

findepi Nov 5, 2024

findepi Nov 5, 2024

findepi Nov 5, 2024 •

edited

Loading

alamb Nov 8, 2024

findepi commented Nov 8, 2024

alamb left a comment

alamb Nov 8, 2024

findepi Nov 8, 2024

findepi Nov 8, 2024

alamb Nov 8, 2024

alamb commented Nov 8, 2024

Fix string view LIKE checks with NULL values #6662

Fix string view LIKE checks with NULL values #6662

Conversation

findepi commented Oct 31, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

findepi commented Oct 31, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Nov 8, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 8, 2024

tustvold Nov 1, 2024 •

edited

Loading

findepi Nov 5, 2024 •

edited

Loading