Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support regex_instr #36958

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

liuzhixing1006
Copy link

@liuzhixing1006 liuzhixing1006 commented Dec 13, 2023

Why I'm doing:

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@liuzhixing1006 liuzhixing1006 requested a review from a team as a code owner December 13, 2023 08:32
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 13, 2023
@liuzhixing1006 liuzhixing1006 changed the title [Feature] support regex_instr #19616 [Feature] support regex_instr Dec 13, 2023
Conflicts:
	be/test/exprs/string_fn_test.cpp
@liuzhixing1006
Copy link
Author

This resolves #19616

Comment on lines 33 to 36
MySQL > SELECT regexp_instr('abCdE', 'ABC');
+--------------------------------------------------------+
| regexp_instr('abCdE', 'AbC'); |
+--------------------------------------------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change 'ABC' to 'AbC' ?

std::string strs[] = {"Alex test", " Alexx test", "AAAAAlex test", "AAlex Alex", "adssdaAlexxx"};
std::string res[] = {"1", "4", "5", "2", "7"};

for (int i = 0; i < sizeof(strs) / sizeof(strs[0]); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not strs.size()?

Comment on lines 3558 to 3559
std::string strs[] = {"Alex test", " Alexx test", "AAAAAlex test", "AAlex Alex", "adssdaAlexxx"};
std::string res[] = {"1", "4", "5", "2", "7"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if str or pattern are null or empty string?

## Syntax

```Haskell
VARCHAR regexp_instr(VARCHAR str, VARCHAR pattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return int or bigint? what's the range?


## Description
This function returns the first matching index in the target value which matches the regular expression pattern. The pattern must completely match some parts of str so that the function can return the exactly index. Be careful, returned character index begin at 1. If no matches are found, it will return an 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note how to process null. keep the same behavior with other DB

Comment on lines 3552 to 3554
if (content_viewer.is_null(row)) {
result.append(0);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return null according to null's semantics, what the behavior of other DB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mysql function docs: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-instr

regexp-instr returns the starting index of the substring of the string expr that matches the regular expression specified by the pattern pat, 0 if there is no match. If expr or pat is NULL, the return value is NULL. Character indexes begin at 1.

When row_data equals "", any pattern can't match row_data, so the function will return 0 instead null's semantics.

Comment on lines 3640 to 3642
if (state->const_pattern) {
return regexp_instr_const(state, columns);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if strings are const?

@@ -300,6 +300,8 @@
'StringFunctions::regexp_extract_prepare', 'StringFunctions::regexp_close'],
[30321, 'regexp_extract_all', 'ARRAY_VARCHAR', ['VARCHAR', 'VARCHAR', 'BIGINT'], 'StringFunctions::regexp_extract_all',
'StringFunctions::regexp_extract_prepare', 'StringFunctions::regexp_close'],
[30322, 'regexp_instr', 'INT', ['VARCHAR', 'VARCHAR'], 'StringFunctions::regexp_instr',
'StringFunctions::regexp_instr_prepare', 'StringFunctions::regexp_close'],
[30330, 'regexp_replace', 'VARCHAR', ['VARCHAR', 'VARCHAR', 'VARCHAR'], 'StringFunctions::regexp_replace',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-range the functions in the ascend order of function ID(s)

@@ -3546,6 +3546,217 @@ PARALLEL_TEST(VecStringFunctionsTest, regexpExtractAllConst) {
}
}

PARALLEL_TEST(VecStringFunctionsTest, regexpInstrPattern) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some tests on real utf8 string?

context->set_error(error.str().c_str());
return Status::InvalidArgument(error.str());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take into consideration that the zeroth argument is also constant. for an example

select regex_instr('abcdef', 'ab[cd]*') from a_very_large_table

or implement constant folding of this function is ScalarOperatorFunctions.java

auto size = columns[0]->size();
ColumnBuilder<TYPE_INT> result(size);
for (int row = 0; row < size; ++row) {
if (pattern_str.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when pattern_str.empty() is true, can we just return a constant result column instead of processing each row inside loop.

std::string row_str = row_value.get_data();
std::string match_str = match.data();

auto index = row_str.find(match_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use instr function instead of find, the former outperforms the the latter.

displayed_sidebar: "Chinese"
---

# regexp_extract
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function name is mismatched, need to be fixed

## 语法

```Haskell
regexp_extract(str, pattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@DanRoscigno DanRoscigno removed the documentation Improvements or additions to documentation label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants