Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal protection against invalid UTF8 encoding. #306

Merged
merged 2 commits into from
Aug 16, 2023

Conversation

rdentato
Copy link
Contributor

@rdentato rdentato commented Aug 16, 2023

Accepts invalid encodings

  • multiple representation for the same character accepted
  • stray non-UTF8 bytes accepted

The minimal protection I've added is to ensure that it does not override the str_buffer memory if the number of stray continuations is greater than max_token_length*2+1

@rdentato
Copy link
Contributor Author

rdentato commented Aug 16, 2023

I've also added some space to str_buffer in case max_token_length is one. Actually that one was needed anyway to allow str_buffer to be used for collecting UTF8 Enconding and not only for handling pairs of tokens.

@karpathy karpathy merged commit 57bf0e9 into karpathy:master Aug 16, 2023
@karpathy
Copy link
Owner

I like it ty

@rdentato
Copy link
Contributor Author

@karpathy. I'll keep the validation branch in my repo and close the PR #300.
Should you feel the need to go through validation, I'll resume it.

@rdentato rdentato deleted the patch-utf8-no-validation branch August 16, 2023 17:09
vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024
minimal protection against invalid UTF8 encoding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants