-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] support regex_instr #36958
base: main
Are you sure you want to change the base?
Conversation
|
Conflicts: be/test/exprs/string_fn_test.cpp
This resolves #19616 |
MySQL > SELECT regexp_instr('abCdE', 'ABC'); | ||
+--------------------------------------------------------+ | ||
| regexp_instr('abCdE', 'AbC'); | | ||
+--------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change 'ABC' to 'AbC' ?
be/test/exprs/string_fn_test.cpp
Outdated
std::string strs[] = {"Alex test", " Alexx test", "AAAAAlex test", "AAlex Alex", "adssdaAlexxx"}; | ||
std::string res[] = {"1", "4", "5", "2", "7"}; | ||
|
||
for (int i = 0; i < sizeof(strs) / sizeof(strs[0]); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not strs.size()?
be/test/exprs/string_fn_test.cpp
Outdated
std::string strs[] = {"Alex test", " Alexx test", "AAAAAlex test", "AAlex Alex", "adssdaAlexxx"}; | ||
std::string res[] = {"1", "4", "5", "2", "7"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if str or pattern are null or empty string?
## Syntax | ||
|
||
```Haskell | ||
VARCHAR regexp_instr(VARCHAR str, VARCHAR pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return int or bigint? what's the range?
|
||
## Description | ||
This function returns the first matching index in the target value which matches the regular expression pattern. The pattern must completely match some parts of str so that the function can return the exactly index. Be careful, returned character index begin at 1. If no matches are found, it will return an 0. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a note how to process null. keep the same behavior with other DB
be/src/exprs/string_functions.cpp
Outdated
if (content_viewer.is_null(row)) { | ||
result.append(0); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not return null according to null's semantics, what the behavior of other DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mysql function docs: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-instr
regexp-instr returns the starting index of the substring of the string expr that matches the regular expression specified by the pattern pat, 0 if there is no match. If expr or pat is NULL, the return value is NULL. Character indexes begin at 1.
When row_data equals "", any pattern can't match row_data, so the function will return 0 instead null's semantics.
be/src/exprs/string_functions.cpp
Outdated
if (state->const_pattern) { | ||
return regexp_instr_const(state, columns); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if strings are const?
@@ -300,6 +300,8 @@ | |||
'StringFunctions::regexp_extract_prepare', 'StringFunctions::regexp_close'], | |||
[30321, 'regexp_extract_all', 'ARRAY_VARCHAR', ['VARCHAR', 'VARCHAR', 'BIGINT'], 'StringFunctions::regexp_extract_all', | |||
'StringFunctions::regexp_extract_prepare', 'StringFunctions::regexp_close'], | |||
[30322, 'regexp_instr', 'INT', ['VARCHAR', 'VARCHAR'], 'StringFunctions::regexp_instr', | |||
'StringFunctions::regexp_instr_prepare', 'StringFunctions::regexp_close'], | |||
[30330, 'regexp_replace', 'VARCHAR', ['VARCHAR', 'VARCHAR', 'VARCHAR'], 'StringFunctions::regexp_replace', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-range the functions in the ascend order of function ID(s)
be/test/exprs/string_fn_test.cpp
Outdated
@@ -3546,6 +3546,217 @@ PARALLEL_TEST(VecStringFunctionsTest, regexpExtractAllConst) { | |||
} | |||
} | |||
|
|||
PARALLEL_TEST(VecStringFunctionsTest, regexpInstrPattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some tests on real utf8 string?
context->set_error(error.str().c_str()); | ||
return Status::InvalidArgument(error.str()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take into consideration that the zeroth argument is also constant. for an example
select regex_instr('abcdef', 'ab[cd]*') from a_very_large_table
or implement constant folding of this function is ScalarOperatorFunctions.java
be/src/exprs/string_functions.cpp
Outdated
auto size = columns[0]->size(); | ||
ColumnBuilder<TYPE_INT> result(size); | ||
for (int row = 0; row < size; ++row) { | ||
if (pattern_str.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when pattern_str.empty() is true, can we just return a constant result column instead of processing each row inside loop.
be/src/exprs/string_functions.cpp
Outdated
std::string row_str = row_value.get_data(); | ||
std::string match_str = match.data(); | ||
|
||
auto index = row_str.find(match_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to use instr function instead of find, the former outperforms the the latter.
displayed_sidebar: "Chinese" | ||
--- | ||
|
||
# regexp_extract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function name is mismatched, need to be fixed
## 语法 | ||
|
||
```Haskell | ||
regexp_extract(str, pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
…gexp_instr Conflicts: be/src/exprs/string_functions.cpp
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: