Modifies the chr() check to properly handle \u0000 codes in the range…#554
Modifies the chr() check to properly handle \u0000 codes in the range…#554vladar merged 2 commits intowebonyx:masterfrom jakevoytko:jakevoytko-utf8-lexing-fix
Conversation
… [127, 255] in the lexer This caused the lexer to output invalid UTF-8 for input like 'pok\u00E9mon'. The output for the é would be the decimal byte 233, which would indicate that it should be a 4 byte unicode sequence, but the following bytes didn't have the leading 10 prefix, so the sequence was invalid UTF-8.
|
I think additional modification is needed to make it worth with strings like '𝕄𝕖𝕥𝕒𝕝𝕝𝕚𝕔' (encoding |
src/Utils/Utils.php
Outdated
| public static function chr($ord, $encoding = 'UTF-8') | ||
| { | ||
| if ($ord <= 255) { | ||
| if ($ord <= 127) { |
There was a problem hiding this comment.
I guess this check only makes sense when $encoding is UTF-8. But we can even try deleting this check altogether since it was a performance optimization that is not required anymore as far as I remember.
There was a problem hiding this comment.
I pushed another commit removing the check entirely per your feedback.
The problem I mentioned about characters like "𝕄𝕖𝕥𝕒𝕝𝕝𝕚𝕔" are a separate problem (UTF-16 surrogates aren't handled in the lexer). I'll mail a patch for that early next week
… [127, 255] in the lexer
This caused the lexer to output invalid UTF-8 for input like 'pok\u00E9mon'. The output for the é
would be the decimal byte 233, which would indicate that it should be a 4 byte unicode sequence,
but the following bytes didn't have the leading 10 prefix, so the sequence was invalid UTF-8.