Implement native support StringViewArray for `regexp_is_match` and `regexp_is_match_scalar` function, deprecate `regexp_is_match_utf8` and `regexp_is_match_utf8_scalar` #6376

tlm365 · 2024-09-10T01:25:10Z

Which issue does this PR close?

Closes #6370.

Rationale for this change

Natively operate on StringViewArray without having to convert first to StringArray
(Potentially) take advantage of the new string view layout

What changes are included in this PR?

Introduce regexp_is_match and regexp_is_match_scalar (which can replace regexp_is_match_utf8 and regexp_is_match_utf8_scalar) can perform on StringArray / LargeStringArray / StringViewArray arguments.

Are there any user-facing changes?

No.

alamb · 2024-09-10T11:17:10Z

Thanks @tlm365 ❤️

I am running the benchmarks on this PR now and will report back when they are complete

alamb

Thank you @tlm365 -- this is looking really nice

I wonder if you might also be willing to add StringView to the benchmarks as well, specifically

arrow-rs/arrow/benches/comparison_kernels.rs

Lines 56 to 62 in 704f90b

    
           fn bench_regexp_is_match_utf8_scalar(arr_a: &StringArray, value_b: &str) { 
        
               regexp_is_match_utf8_scalar( 
        
                   criterion::black_box(arr_a), 
        
                   criterion::black_box(value_b), 
        
                   None, 
        
               ) 
        
               .unwrap();

So that if this code is changed in the future we can ensure it doesn't regress in performance

alamb · 2024-09-10T11:18:18Z

arrow-string/src/regexp.rs

+    );
+    test_flag_utf8!(
+        test_utf8_array_regexp_is_match_insensitive_2,
+        StringViewArray::from(vec!["arrow", "arrow", "arrow", "arrow", "arrow", "arrow"]),


StringViewArray has special case handling for strings that are more than 12 bytes long (the string data is stored out of band in those cases)

Can you please add tests that have some strings that are longer than 12 bytes?

Can you please add tests that have some strings that are longer than 12 bytes?

Yes, noted. I will review and update test cases for this scenario.

alamb · 2024-09-10T11:18:54Z

arrow-string/src/regexp.rs

 /// See the documentation on [`regexp_is_match_utf8`] for more details.
-pub fn regexp_is_match_utf8_scalar<OffsetSize: OffsetSizeTrait>(
-    array: &GenericStringArray<OffsetSize>,
+pub fn regexp_is_match_utf8_scalar<'a, S>(


Unfortunately, I think this is a API change (as is the above)

I have an idea of how to update this PR to avoid an API change -- the reason this is important is that a breaking API change would need to wait until the next major release (Dec 2024) per the release schedule: https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule

TLDR is I think if we introduced a new function like the following:

fn regexp_is_match( array: &dyn Array, regex_array: &dyn Array, flags_array: Option<&dyn Array, >, ) -> Result<BooleanArray, ArrowError> { .. } `` We could then support StringView and StringArray and LargeStringArray

TLDR is I think if we introduced a new function like the following:

@alamb Sounds good 👍 But why do we use &dyn Array for the new regex_is_match function instead of keeping the current implementation?

Or am I misunderstanding you? I understand that we will provide a new regex_is_match function, and mark the current regex_is_match_utf8 function as:

#[deprecated(since="54.0.0", note="please use `regex_is_match` instead")] pub fn regexp_is_match_utf8(...) { ... }

Is that right? 🤔

tlm365 · 2024-09-10T13:38:09Z

I wonder if you might also be willing to add StringView to the benchmarks as well, specifically
So that if this code is changed in the future we can ensure it doesn't regress in performance

@alamb Thanks for reviewing, willing to add benchmark for this one. I will update it soon.

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

alamb · 2024-09-11T15:29:59Z

Here are the benchmark results (aka this PR doesn't slow down the existing implementation)

++ critcmp master regex-is-match-utf8
group                                                     master                                 regex-is-match-utf8
-----                                                     ------                                 -------------------
regexp_matches_utf8 scalar ends with                      1.02  1932.3±20.07µs        ? ?/sec    1.00  1898.4±17.40µs        ? ?/sec
regexp_matches_utf8 scalar starts with                    1.00  1932.1±14.07µs        ? ?/sec    1.00  1924.9±26.17µs        ? ?/sec

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

alamb · 2024-09-18T20:15:34Z

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

more

alamb

Thank you very much @tlm365 -- this looks great.

I was reviewing this PR and I had the code checked out locally, so I took the liberty of making a few changes:

I fixed clippy (was failing due to using deprecated functions)
I updated the comments / added an example to ease the transition (passing None as the flags argument results in type inference errors without some type help)
Improved the comments in general making it clearer what regexp_is_match does and how it is related to regexp_match

alamb · 2024-09-18T21:52:10Z

arrow-string/src/regexp.rs

+/// * [`regexp_is_match_scalar`] for matching a single regular expression against an array of strings
+/// * [`regexp_match`] for extracting groups from a string array based on a regular expression
+///
+/// # Example


I added an example to help the migration

alamb · 2024-09-18T21:52:45Z

arrow-string/src/regexp.rs

    regex_array: &GenericStringArray<OffsetSize>,
    flags_array: Option<&GenericStringArray<OffsetSize>>,
 ) -> Result<BooleanArray, ArrowError> {
+    regexp_is_match(array, regex_array, flags_array)


I switched the implementation to just call the new function to avoid duplication

alamb · 2024-09-19T15:31:26Z

@tlm365 I wonder if you have a few minutes to review the changes I pushed to this PR.

I again I am sorry about the review delays

tlm365 · 2024-09-19T17:15:47Z

@tlm365 I wonder if you have a few minutes to review the changes I pushed to this PR.

I again I am sorry about the review delays

@alamb Oops, thank you so much for reviewing. Sorry 🙇 I've been a little busy lately. Noted and will come back to review this weekend.

tlm365 · 2024-09-21T08:53:57Z

Thank you very much @tlm365 -- this looks great.

I was reviewing this PR and I had the code checked out locally, so I took the liberty of making a few changes:

I fixed clippy (was failing due to using deprecated functions)

I updated the comments / added an example to ease the transition (passing None as the flags argument results in type inference errors without some type help)

Improved the comments in general making it clearer what regexp_is_match does and how it is related to regexp_match

@alamb it looks very nice 👍 thank you so much for this update! ❤️

Dandandan · 2024-09-21T20:44:58Z

Thanks @tlm365 and @alamb

alamb · 2024-10-02T18:29:49Z

We updated this PR so it was not an API change so removing the label

Implement native support StringViewArray for regex_is_match function

5088c2c

github-actions bot added the arrow Changes to the arrow crate label Sep 10, 2024

alamb reviewed Sep 10, 2024

View reviewed changes

alamb added the api-change Changes to the arrow API label Sep 10, 2024

tlm365 marked this pull request as draft September 11, 2024 02:35

Update test cases cover StringViewArray length more then 12 bytes

595d64c

alamb mentioned this pull request Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 2, 2024 apache/datafusion#12336

Closed

4 tasks

Add StringView benchmark for regexp_is_match

e80deea

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

tlm365 force-pushed the regex-is-match-utf8 branch from 514847f to e80deea Compare September 11, 2024 11:05

alamb mentioned this pull request Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 apache/datafusion#12391

Closed

5 tasks

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 apache/datafusion#12494

Closed

8 tasks

tlm365 force-pushed the regex-is-match-utf8 branch from c4763a9 to 65e6839 Compare September 17, 2024 17:05

tlm365 changed the title ~~Implement native support StringViewArray for regexp_is_match_utf8 and regexp_is_match_utf8_scalar function~~ Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function Sep 17, 2024

Implement native support StringViewArray for regex_is_match function

4fed56b

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

tlm365 force-pushed the regex-is-match-utf8 branch from 65e6839 to 4fed56b Compare September 18, 2024 10:04

tlm365 marked this pull request as ready for review September 18, 2024 10:05

Remove duplicate implementation, fix clippy, add docs

e10b474

more

alamb approved these changes Sep 18, 2024

View reviewed changes

Dandandan approved these changes Sep 21, 2024

View reviewed changes

Dandandan merged commit d05cf6d into apache:master Sep 21, 2024
24 checks passed

alamb mentioned this pull request Oct 2, 2024

Upgrade arrow/parquet to 53.1.0 / fix clippy apache/datafusion#12724

Merged

alamb removed the api-change Changes to the arrow API label Oct 2, 2024

tlm365 deleted the regex-is-match-utf8 branch November 10, 2024 15:33

Omega359 mentioned this pull request Nov 11, 2024

regexp_match does not support Utf8View apache/datafusion#13357

Closed

tlm365 mentioned this pull request Nov 17, 2024

Remove redundant implementation of StringArrayType #6743

Merged

	fn bench_regexp_is_match_utf8_scalar(arr_a: &StringArray, value_b: &str) {
	regexp_is_match_utf8_scalar(
	criterion::black_box(arr_a),
	criterion::black_box(value_b),
	None,
	)
	.unwrap();

Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function, deprecate regexp_is_match_utf8 and regexp_is_match_utf8_scalar #6376

Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function, deprecate regexp_is_match_utf8 and regexp_is_match_utf8_scalar #6376

Uh oh!

Conversation

tlm365 commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb commented Sep 10, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

tlm365 Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

tlm365 Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlm365 commented Sep 10, 2024

Uh oh!

alamb commented Sep 11, 2024

Uh oh!

alamb commented Sep 18, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 19, 2024

Uh oh!

tlm365 commented Sep 19, 2024

Uh oh!

tlm365 commented Sep 21, 2024

Uh oh!

Uh oh!

Dandandan commented Sep 21, 2024

Uh oh!

alamb commented Oct 2, 2024

Uh oh!

Uh oh!

Implement native support StringViewArray for `regexp_is_match` and `regexp_is_match_scalar` function, deprecate `regexp_is_match_utf8` and `regexp_is_match_utf8_scalar` #6376

Implement native support StringViewArray for `regexp_is_match` and `regexp_is_match_scalar` function, deprecate `regexp_is_match_utf8` and `regexp_is_match_utf8_scalar` #6376

tlm365 commented Sep 10, 2024 •

edited

Loading

tlm365 Sep 10, 2024 •

edited

Loading

tlm365 Sep 11, 2024 •

edited

Loading