Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow common process_escapes to handle \x sequences #3928

Merged
merged 2 commits into from
Nov 5, 2023

Conversation

KerfuffleV2
Copy link
Collaborator

For example, you can do something like \xf0\x9f\x98\x82 to enter 😂 with these changes.

Comment on lines 96 to 99
const char x[3] = { input[input_idx + 1], input[input_idx + 2], 0 };
char *err_p = nullptr;
const long val = std::strtol(x, &err_p, 16);
if (*err_p == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is C++, why not something like this?

Suggested change
const char x[3] = { input[input_idx + 1], input[input_idx + 2], 0 };
char *err_p = nullptr;
const long val = std::strtol(x, &err_p, 16);
if (*err_p == 0) {
std::string x(input + input_idx + 1, 2);
size_t pos = 0;
const long val = std::stol(x, &pos, 16);
if (pos == x.length()) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the first reason is I didn't know to write it that way. :) I think there's a reason to keep it as-is though, creating a const 3 character array probably has lower overhead than making and destroying a std::string, concatenating some other strings, etc. It's probably a small difference, but hey, when I'm feeding a 100,000 token prompt to a 128K YaRN model maybe it'll be .5sec faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of small-string optimization there won't be any heap allocation, but I suppose std::string is still slightly more expensive.

If we keep the C version, we should replace const char x[3] with just const char x[] since the size is inferred.

C++ has had std::string_view and std::from_chars since 2017, but we can't use them :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized the previous code didn't correctly handle a case where the second hex digit was \0. This version is a little simpler and clearer also. However, I kept the [3] since I feel better about explicitly setting the size when doing pointer arithmetic on the array like that.

@KerfuffleV2 KerfuffleV2 merged commit d9ccce2 into ggerganov:master Nov 5, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
* Allow common process_escapes to handle \x sequences

* Fix edge case when second hex digit is NUL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants