-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-15644: [C++][Gandiva] Implement Find_In_Set Function #12391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15644: [C++][Gandiva] Implement Find_In_Set Function #12391
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
01b0973 to
b711271
Compare
|
|
3d6fb68 to
ed5c2c8
Compare
86cea33 to
e70dca9
Compare
c095449 to
a7a4753
Compare
a7a4753 to
66c25c7
Compare
66c25c7 to
b7c3316
Compare
86c145c to
fcbd2ef
Compare
b6970fe to
afba45c
Compare
e978d81 to
502072f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not correct..if there are empty strings then result can be positive no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@projjal
I followed the Hive results, should I return false validity?
I found one error in my return...
If find_len && list_len == 0 the return is 1.
74ef20a to
e4fc0bf
Compare
e4fc0bf to
d923d11
Compare
095f1e0 to
92c8bd6
Compare
|
Hmm... Arrow has a proper List type, so why take a comma-delimited string? |
| int32_t find_in_set_utf8_utf8(int64_t context, const char* to_find, int32_t to_find_len, | ||
| const char* string_list, int32_t string_list_len) { | ||
| // Return 0 if to search entry have commas | ||
| if (is_substr_utf8_utf8(to_find, to_find_len, reinterpret_cast<const char*>(","), 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are looking for a single unicode codepoint below 128, you can probably do this faster using memchr.
| cur_length = 0; | ||
| } | ||
| } else { | ||
| if (cur_length + 1 <= string_list_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this condition? In which situation is it false?
| } | ||
| } else { | ||
| if (cur_length + 1 <= string_list_len) { | ||
| if (!matching || (memcmp(string_list + i, to_find + cur_length, 1))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call memcmp if you are only comparing a single byte?
@pitrou there are two things:
In the future, after adding the support for complex types we can add support for it using lists |
Returns the first occurrence of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas.
For example, find_in_set('ab', 'abc,b,ab,c,def')
returns 3