ARROW-11693: [C++] Add string length kernel #9786

edponce · 2021-03-24T00:34:36Z

This PR adds the utf8_length compute kernel to the string scalar functions to support calculating the string length (as number of characters) for UTF-8 encoded STRINGs and LARGE STRINGs. The implementation makes use of utf8proc (utf8proc_iterate) to perform the calculation.

github-actions · 2021-03-24T01:24:49Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2021-03-24T01:52:08Z

https://issues.apache.org/jira/browse/ARROW-11693

pitrou

Thank you @edponce for this PR. Here are a couple comments and questions.

pitrou · 2021-03-24T12:29:10Z

cpp/src/arrow/array/array_binary_test.cc

    // Single UTF8 character straddles two entries
    auto st3 = ValidateFull(2, {0, 1, 2}, "\xc3\xa9");
+    // Null characters in the string
+    auto st4 = ValidateFull(1, {0, 4}, "\0\0\0\0");


Can you explain why this is invalid? Unicode character 0 is a valid unicode character.

Also, it works using PyArrow:

>>> arr = pa.array(["\0"]) >>> arr <pyarrow.lib.StringArray object at 0x7f6befb354b0> [ "" ] >>> arr.to_pylist() ['\x00'] >>> arr.validate(full=True) >>>

You are right, this case is valid. The UTF-8 standard (https://tools.ietf.org/html/rfc3629) states that characters consisting of 1-byte range from 0-7F (Section 3). But a multibyte UTF-8 character will not contain NUL as a continuation byte, the encoding does not allows it.
Valid: '\x00'
Invalid: '\xe10080'

Also, note that UTF8 validation is tested more thoroughly in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/utf8_util_test.cc

cpp/src/arrow/compute/kernels/scalar_string.cc

…8_proc)

pitrou

Thanks for the updates! Here are two more comments.

cpp/src/arrow/compute/kernels/scalar_string.cc

cpp/src/arrow/compute/kernels/scalar_string_test.cc

pitrou

Thank you very much @edponce . I will merge this once CI passes.

edponce added 4 commits March 23, 2021 18:39

Add utf8_length documentation

bbb32e7

add a UTF8 case to binary_length, add an equivalent test for utf8_length

8630d24

add a test ensuring "\0" is invalid utf8

05d6bde

add utf8_length kernels for STRING and LARGE_STRING

6379ff4

edponce marked this pull request as draft March 24, 2021 00:38

edponce changed the title ~~ARROW 11693: [C++] dd string length kernel~~ ARROW-11693: [C++] Add string length kernel Mar 24, 2021

edponce marked this pull request as ready for review March 24, 2021 00:44

github-actions bot added the Component: C++ label Mar 24, 2021

pitrou reviewed Mar 24, 2021

View reviewed changes

bkietz self-requested a review March 24, 2021 14:11

edponce added 2 commits March 24, 2021 11:54

add optimized implementation for UTF8 strlen (does not depends on utf…

0c81029

…8_proc)

fix UTF8 tests, NUL is a valid character

16deddf

pitrou reviewed Mar 24, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_string.cc Show resolved Hide resolved

cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved

edponce added 2 commits March 24, 2021 12:52

add tests for utf8_length (emoji and mixed ASCII/UTF8)

8e8f37b

fix format of C++ files

46a6b7a

pitrou approved these changes Mar 24, 2021

View reviewed changes

bkietz approved these changes Mar 24, 2021

View reviewed changes

pitrou closed this in 9ad2562 Mar 24, 2021

asfimport mentioned this pull request Apr 12, 2021

[C++] Add string length kernel #27555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ARROW-11693: [C++] Add string length kernel #9786

ARROW-11693: [C++] Add string length kernel #9786

Uh oh!

edponce commented Mar 24, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2021

Uh oh!

github-actions bot commented Mar 24, 2021

Uh oh!

pitrou left a comment

Uh oh!

pitrou Mar 24, 2021

Uh oh!

pitrou Mar 24, 2021

Uh oh!

edponce Mar 24, 2021 •

edited

Loading

Uh oh!

pitrou Mar 24, 2021

Uh oh!

Uh oh!

pitrou left a comment

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ARROW-11693: [C++] Add string length kernel #9786

ARROW-11693: [C++] Add string length kernel #9786

Uh oh!

Conversation

edponce commented Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2021

Uh oh!

github-actions bot commented Mar 24, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

pitrou Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

edponce Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edponce commented Mar 24, 2021 •

edited

Loading

edponce Mar 24, 2021 •

edited

Loading