-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12716: [C++] Add string padding kernel #10586
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small python test to just ensure the cython options class is working correctly?
Done (also set the default pad character to ' ' to be nicer to Python users). |
@@ -93,19 +93,23 @@ struct BinaryLength { | |||
} | |||
}; | |||
|
|||
template <typename OffsetValue> | |||
OffsetValue CountCodepoints(const uint8_t* str, size_t strlen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to template this on OffsetValue
, just return int64_t
.
Also, since this is a generally useful function, perhaps put it in arrow/util/utf8.h
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved & added a small test case.
|
||
int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits, | ||
uint8_t* output) { | ||
// XXX input_string_ncodeunits is misleading - it's just bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"codeunits" is the same as bytes for UTF8 ("codeunits" as opposed to "codepoints").
But, yes, it's pointlessly pedantic. Perhaps a wholesale renaming would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, somehow I had thought a codeunit was an encoded codepoint (i.e. 1-4 bytes). I'll remove the comment at least so as not to be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @lidavidm
No description provided.