-
Couldn't load subscription status.
- Fork 3.9k
ARROW-11693: [C++] Add string length kernel #9786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11693: [C++] Add string length kernel #9786
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @edponce for this PR. Here are a couple comments and questions.
| // Single UTF8 character straddles two entries | ||
| auto st3 = ValidateFull(2, {0, 1, 2}, "\xc3\xa9"); | ||
| // Null characters in the string | ||
| auto st4 = ValidateFull(1, {0, 4}, "\0\0\0\0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is invalid? Unicode character 0 is a valid unicode character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it works using PyArrow:
>>> arr = pa.array(["\0"])
>>> arr
<pyarrow.lib.StringArray object at 0x7f6befb354b0>
[
""
]
>>> arr.to_pylist()
['\x00']
>>> arr.validate(full=True)
>>> There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this case is valid. The UTF-8 standard (https://tools.ietf.org/html/rfc3629) states that characters consisting of 1-byte range from 0-7F (Section 3). But a multibyte UTF-8 character will not contain NUL as a continuation byte, the encoding does not allows it.
Valid: '\x00'
Invalid: '\xe10080'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, note that UTF8 validation is tested more thoroughly in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/utf8_util_test.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Here are two more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @edponce . I will merge this once CI passes.
This PR adds the utf8_length compute kernel to the string scalar functions to support calculating the string length (as number of characters) for UTF-8 encoded STRINGs and LARGE STRINGs. The implementation makes use of utf8proc (utf8proc_iterate) to perform the calculation.