ARROW-15644: [C++][Gandiva] Implement Find_In_Set Function #12391

ViniciusSouzaRoque · 2022-02-10T11:22:21Z

Returns the first occurrence of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas.

For example, find_in_set('ab', 'abc,b,ab,c,def')
returns 3

github-actions · 2022-02-10T11:22:43Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2022-02-10T11:29:57Z

https://issues.apache.org/jira/browse/ARROW-15644

github-actions · 2022-02-10T11:29:59Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

cpp/src/gandiva/precompiled/string_ops.cc

projjal · 2022-07-20T05:02:45Z

cpp/src/gandiva/precompiled/string_ops.cc

this is not correct..if there are empty strings then result can be positive no.

@projjal
I followed the Hive results, should I return false validity?
I found one error in my return...
If find_len && list_len == 0 the return is 1.

cpp/src/gandiva/precompiled/string_ops_test.cc

anthonylouisbsb · 2022-08-03T14:22:41Z

@kou @pitrou Can you check and merge that PR? It is approved by Projjal and me

pitrou · 2022-08-03T14:53:24Z

Hmm... Arrow has a proper List type, so why take a comma-delimited string?

pitrou · 2022-08-03T14:55:21Z

cpp/src/gandiva/precompiled/string_ops.cc

+int32_t find_in_set_utf8_utf8(int64_t context, const char* to_find, int32_t to_find_len,
+                              const char* string_list, int32_t string_list_len) {
+  // Return 0 if to search entry have commas
+  if (is_substr_utf8_utf8(to_find, to_find_len, reinterpret_cast<const char*>(","), 1)) {


Since you are looking for a single unicode codepoint below 128, you can probably do this faster using memchr.

pitrou · 2022-08-03T14:59:05Z

cpp/src/gandiva/precompiled/string_ops.cc

+        cur_length = 0;
+      }
+    } else {
+      if (cur_length + 1 <= string_list_len) {


Why this condition? In which situation is it false?

pitrou · 2022-08-03T14:59:25Z

cpp/src/gandiva/precompiled/string_ops.cc

+      }
+    } else {
+      if (cur_length + 1 <= string_list_len) {
+        if (!matching || (memcmp(string_list + i, to_find + cur_length, 1))) {


Why call memcmp if you are only comparing a single byte?

anthonylouisbsb · 2022-08-03T17:47:33Z

Hmm... Arrow has a proper List type, so why take a comma-delimited string?

@pitrou there are two things:

Currently, the Gandiva project does not accept complex types(we need to plan to add them in the future)
That function we got from the Apache Hive project and the signature is a comma-separated string: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27362046 (if search for find_in_set you will get the function signature)

In the future, after adding the support for complex types we can add support for it using lists

github-actions bot added Component: Gandiva Component: C++ labels Feb 10, 2022

ViniciusSouzaRoque changed the title ~~[C++][Gandiva] Implement Find_In_Set Function~~ ARROW-15644: [C++][Gandiva] Implement Find_In_Set Function Feb 10, 2022

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from 01b0973 to b711271 Compare February 10, 2022 11:29

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch 3 times, most recently from 3d6fb68 to ed5c2c8 Compare February 10, 2022 13:45

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from 86cea33 to e70dca9 Compare April 18, 2022 09:34

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from c095449 to a7a4753 Compare April 26, 2022 09:52

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from a7a4753 to 66c25c7 Compare May 19, 2022 16:05

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from 66c25c7 to b7c3316 Compare June 7, 2022 10:49

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch 2 times, most recently from 86c145c to fcbd2ef Compare June 27, 2022 10:25

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch 3 times, most recently from b6970fe to afba45c Compare July 1, 2022 10:43

projjal reviewed Jul 4, 2022

View reviewed changes

cpp/src/gandiva/precompiled/string_ops.cc Outdated Show resolved Hide resolved

projjal reviewed Jul 4, 2022

View reviewed changes

cpp/src/gandiva/precompiled/string_ops.cc Outdated Show resolved Hide resolved

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch 4 times, most recently from e978d81 to 502072f Compare July 14, 2022 11:18

projjal reviewed Jul 20, 2022

View reviewed changes

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch 2 times, most recently from 74ef20a to e4fc0bf Compare July 21, 2022 18:41

PHILO-HE added a commit to PHILO-HE/arrow that referenced this pull request Jul 22, 2022

Port the code from PR apache#12391 of apache arrow

2df3c06

projjal reviewed Jul 22, 2022

View reviewed changes

cpp/src/gandiva/precompiled/string_ops_test.cc Outdated Show resolved Hide resolved

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from e4fc0bf to d923d11 Compare July 22, 2022 11:33

zhouyuan pushed a commit to oap-project/arrow that referenced this pull request Jul 25, 2022

Port the code from PR apache#12391 of apache arrow (#131)

f315732

ViniciusSouzaRoque added 7 commits July 25, 2022 07:08

First implementation function Find in Set

50fd0af

Added UTF8 Support

b4c6692

Change output test name

c760975

Remove invalid return to empty strings

443e9b0

Skip utf8 length check

a0ae3e4

Fix return to empty strings

67e2182

Add requested tests

92c8bd6

ViniciusSouzaRoque force-pushed the feature/add-find-in-set-function branch from 095f1e0 to 92c8bd6 Compare July 25, 2022 10:08

Empty commit to run CI

3dcca02

projjal approved these changes Jul 29, 2022

View reviewed changes

anthonylouisbsb approved these changes Aug 3, 2022

View reviewed changes

pitrou reviewed Aug 3, 2022

View reviewed changes

ViniciusSouzaRoque closed this Sep 9, 2022

ARROW-15644: [C++][Gandiva] Implement Find_In_Set Function #12391

ARROW-15644: [C++][Gandiva] Implement Find_In_Set Function #12391

Uh oh!

Conversation

ViniciusSouzaRoque commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 10, 2022

Uh oh!

github-actions bot commented Feb 10, 2022

Uh oh!

github-actions bot commented Feb 10, 2022

Uh oh!

Uh oh!

Uh oh!

projjal Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

ViniciusSouzaRoque Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anthonylouisbsb commented Aug 3, 2022

Uh oh!

pitrou commented Aug 3, 2022

Uh oh!

pitrou Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

pitrou Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

pitrou Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

anthonylouisbsb commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ViniciusSouzaRoque commented Feb 10, 2022 •

edited

Loading

anthonylouisbsb commented Aug 3, 2022 •

edited

Loading