Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The issue here is that .NET strings are UTF-16; i.e., every
char
represents a 16-bit ordinal 2.By contrast, UTF-8 uses variable byte lengths: 1 byte when that's enough (< 0x7F); 2 or more bytes for higher code points:
char
sGiven a string like "Aö你", the character count will differ from the byte count:
When we iterate this string one
char
at a time, each offset moves 16 bits forward.Given a UTF-8 string, we need to iterate in 8-bit segments, or we'll miss characters.
To do that, we can iterate by bytes, which this patch implements.
Fixes https://community.notepad-plus-plus.org/topic/23471/custom-lexer-and-unicode-utf-8-text-file-content
Footnotes
https://docs.microsoft.com/en-us/answers/questions/587680/where-can-i-find-34beta-use-unicode-utf-8-for-worl.html ↩
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction#the-string-and-char-types ↩