Skip to content

Conversation

@edponce
Copy link
Contributor

@edponce edponce commented Mar 24, 2021

This PR adds the utf8_length compute kernel to the string scalar functions to support calculating the string length (as number of characters) for UTF-8 encoded STRINGs and LARGE STRINGs. The implementation makes use of utf8proc (utf8proc_iterate) to perform the calculation.

@edponce edponce marked this pull request as draft March 24, 2021 00:38
@edponce edponce changed the title ARROW 11693: [C++] dd string length kernel ARROW-11693: [C++] Add string length kernel Mar 24, 2021
@edponce edponce marked this pull request as ready for review March 24, 2021 00:44
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @edponce for this PR. Here are a couple comments and questions.

// Single UTF8 character straddles two entries
auto st3 = ValidateFull(2, {0, 1, 2}, "\xc3\xa9");
// Null characters in the string
auto st4 = ValidateFull(1, {0, 4}, "\0\0\0\0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is invalid? Unicode character 0 is a valid unicode character.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it works using PyArrow:

>>> arr = pa.array(["\0"])
>>> arr
<pyarrow.lib.StringArray object at 0x7f6befb354b0>
[
  ""
]
>>> arr.to_pylist()
['\x00']
>>> arr.validate(full=True)
>>> 

Copy link
Contributor Author

@edponce edponce Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this case is valid. The UTF-8 standard (https://tools.ietf.org/html/rfc3629) states that characters consisting of 1-byte range from 0-7F (Section 3). But a multibyte UTF-8 character will not contain NUL as a continuation byte, the encoding does not allows it.
Valid: '\x00'
Invalid: '\xe10080'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that UTF8 validation is tested more thoroughly in https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/utf8_util_test.cc

@bkietz bkietz self-requested a review March 24, 2021 14:11
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Here are two more comments.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @edponce . I will merge this once CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants