Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12716: [C++] Add string padding kernel #10586

Closed
wants to merge 4 commits into from

Conversation

lidavidm
Copy link
Member

No description provided.

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small python test to just ensure the cython options class is working correctly?

@lidavidm
Copy link
Member Author

Done (also set the default pad character to ' ' to be nicer to Python users).

@@ -93,19 +93,23 @@ struct BinaryLength {
}
};

template <typename OffsetValue>
OffsetValue CountCodepoints(const uint8_t* str, size_t strlen) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to template this on OffsetValue, just return int64_t.
Also, since this is a generally useful function, perhaps put it in arrow/util/utf8.h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved & added a small test case.


int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
uint8_t* output) {
// XXX input_string_ncodeunits is misleading - it's just bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"codeunits" is the same as bytes for UTF8 ("codeunits" as opposed to "codepoints").
But, yes, it's pointlessly pedantic. Perhaps a wholesale renaming would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, somehow I had thought a codeunit was an encoded codepoint (i.e. 1-4 bytes). I'll remove the comment at least so as not to be misleading.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @lidavidm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants