Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifies the chr() check to properly handle \u0000 codes in the range… #554

Merged
merged 2 commits into from
Oct 15, 2019
Merged

Modifies the chr() check to properly handle \u0000 codes in the range… #554

merged 2 commits into from
Oct 15, 2019

Conversation

jakevoytko
Copy link
Contributor

… [127, 255] in the lexer

This caused the lexer to output invalid UTF-8 for input like 'pok\u00E9mon'. The output for the é
would be the decimal byte 233, which would indicate that it should be a 4 byte unicode sequence,
but the following bytes didn't have the leading 10 prefix, so the sequence was invalid UTF-8.

… [127, 255] in the lexer

This caused the lexer to output invalid UTF-8 for input like 'pok\u00E9mon'. The output for the é
would be the decimal byte 233, which would indicate that it should be a 4 byte unicode sequence,
but the following bytes didn't have the leading 10 prefix, so the sequence was invalid UTF-8.
@jakevoytko
Copy link
Contributor Author

I think additional modification is needed to make it worth with strings like '𝕄𝕖𝕥𝕒𝕝𝕝𝕚𝕔' (encoding '"\ud835\udd44\ud835\udd56\ud835\udd65\ud835\udd52\ud835\udd5d\ud835\udd5d\ud835\udd5a\ud835\udd54"'). Consider holding off on reviewing it

@@ -442,7 +442,7 @@ public static function printSafe($var)
*/
public static function chr($ord, $encoding = 'UTF-8')
{
if ($ord <= 255) {
if ($ord <= 127) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this check only makes sense when $encoding is UTF-8. But we can even try deleting this check altogether since it was a performance optimization that is not required anymore as far as I remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed another commit removing the check entirely per your feedback.

The problem I mentioned about characters like "𝕄𝕖𝕥𝕒𝕝𝕝𝕚𝕔" are a separate problem (UTF-16 surrogates aren't handled in the lexer). I'll mail a patch for that early next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants