Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support StringView for binary operators like ~, !~, etc #12180

Closed
Tracked by #11752
alamb opened this issue Aug 26, 2024 · 4 comments
Closed
Tracked by #11752

Support StringView for binary operators like ~, !~, etc #12180

alamb opened this issue Aug 26, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 26, 2024

Is your feature request related to a problem or challenge?

Part of #11752

As we work to complete StringView support in DataFusion @2010YOUY01 noticed on #11752 (comment) that we don't currently support Regexp like binary operators https://datafusion.apache.org/user-guide/sql/operators.html#op-re-match for string view

Reproducer

CREATE TABLE t0(v0 DOUBLE, v1 DOUBLE, v2 BOOLEAN, v3 BOOLEAN, v4 BOOLEAN, v5 STRING);
INSERT INTO t0(v1, v5, v2) VALUES (0.7183242196192607, 'Tn', true);
CREATE TABLE t0_stringview AS SELECT v0, v1, v2, v3, v4, arrow_cast(v5, 'Utf8View') as v5 FROM t0;
> select v5 ~ 'foo' from t0_stringview;
Internal error: Data type Utf8View not supported for binary_string_array_flag_op_scalar operation 'regexp_is_match' on string array.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
> select regexp_match(v5, 'foo') from t0_stringview;
+--------------------------------------------+
| regexp_match(t0_stringview.v5,Utf8("foo")) |
+--------------------------------------------+
|                                            |
+--------------------------------------------+
1 row(s) fetched.
Elapsed 0.034 seconds.

Describe the solution you'd like

StringView should be supported for these operators (aka the query should run without error)

Describe alternatives you've considered

Here are the relevant operator names:

            | Operator::RegexMatch
            | Operator::RegexIMatch
            | Operator::RegexNotMatch
            | Operator::RegexNotIMatch
            | Operator::LikeMatch
            | Operator::ILikeMatch
            | Operator::NotLikeMatch
            | Operator::NotILikeMatch

Here is the dispatch code:

RegexMatch => {
binary_string_array_flag_op!(left, right, regexp_is_match, false, false)
}
RegexIMatch => {
binary_string_array_flag_op!(left, right, regexp_is_match, false, true)
}
RegexNotMatch => {
binary_string_array_flag_op!(left, right, regexp_is_match, true, false)
}
RegexNotIMatch => {
binary_string_array_flag_op!(left, right, regexp_is_match, true, true)
}

It appears that the corresponding arrow-rs kernel does not yet have support for StringView
https://docs.rs/arrow-string/52.2.0/src/arrow_string/regexp.rs.html#307-311

So what I would suggest is:

  1. Implement a PR in datafusion with coercion from Utf8View --> Utf8 (aka cast arguments back to string)
  2. File an upstream ticket in arrow-rs for supporting string view with the regexp_like kernels and leave a link to that ticket in the datafusion code

Additional context

No response

@alamb alamb added the enhancement New feature or request label Aug 26, 2024
@alamb alamb changed the title Support StringView for bianry operators like ~, !~, etc Support StringView for binary operators like ~, !~, etc Aug 26, 2024
@tlm365
Copy link
Contributor

tlm365 commented Aug 27, 2024

Looks like #12168 also might be a starting point to solve this problem. I also think about solution 2 when handling that PR, and IMO solution 2 is the better option. I will take this one.

File an upstream ticket in arrow-rs for supporting string view with the regexp_like kernels and leave a link to that ticket in the datafusion code

@tlm365
Copy link
Contributor

tlm365 commented Aug 27, 2024

take

@alamb
Copy link
Contributor Author

alamb commented Aug 27, 2024

I also think about solution 2 when handling that PR, and IMO solution 2 is the better option.

I agree solution 2 is better -- it just will take longer as it needs two coordinated PRs. One thing we have done in the past is do the initial implementation in DataFusion, and then file a ticket / port the code upstream to arrow-rs. Once the code is release in arrow-rs and available in Datafusion we remove the copy in DataFusion

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

I think this ticket is done now (though there is some follow up work to improve performance / port the code upstream)

Thanks again @tlm365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants
@alamb @tlm365 and others