Skip to content

Commit

Permalink
Fix: error decoding some Korean Hangul graphemes (Mudlet#7431)
Browse files Browse the repository at this point in the history
#### Brief overview of PR changes/additions
Extends the testing for a range of sequences of UTF-8 encoded bytes to
be "tighter" - so that a smaller (half the previous) range of
code-points were rejected. This range should now only include those that
would indicate UTF-16BE surrogate code points (UTF-16, like UTF-8, is a
variable length encoding and that range is reserved for conveying the
code-points that need a pair of UTF-16 values) and NOT include some of
the range (inside U+D000 to U+D7FF) that is taken by graphemes
particularly used for Korean.

#### Motivation for adding to Mudlet
A Korean user noticed that some graphemes in the text for the MUD they
were on were not being displayed correctly by Mudlet as they were being
replaced by the "Replacement Character" but a screen shot from a
different client showed proper Korean (Hangul) ones. Testing revealed
that a range of code-points were being rejected as being High or Low
Surrogates when they were not.

#### Other info (issues closed, discussion etc)
This should close Mudlet#7429

Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
Co-authored-by: Vadim Peretokin <vperetokin@hey.com>
  • Loading branch information
SlySven and vadi2 authored Sep 11, 2024
1 parent c0b12bc commit 2096535
Showing 1 changed file with 16 additions and 4 deletions.
20 changes: 16 additions & 4 deletions src/TBuffer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3553,6 +3553,7 @@ bool TBuffer::processUtf8Sequence(const std::string& bufferData, const bool isFr
bool isToUseByteOrderMark = false; // When BOM seen in stream it transcodes as zero characters
switch (utf8SequenceLength) {
case 4:
// Check the 4th byte is a valid continuation byte (2 MS-Bits are 10)
if ((bufferData.at(pos + 3) & 0xC0) != 0x80) {
#if defined(DEBUG_UTF8_PROCESSING)
qDebug() << "TBuffer::processUtf8Sequence(...) 4th byte in UTF-8 sequence is invalid!";
Expand Down Expand Up @@ -3584,13 +3585,14 @@ bool TBuffer::processUtf8Sequence(const std::string& bufferData, const bool isFr
// Fall-through
[[fallthrough]];
case 3:
// Check the 3rd byte is a valid continuation byte (2 MS-Bits are 10)
if ((bufferData.at(pos + 2) & 0xC0) != 0x80) {
#if defined(DEBUG_UTF8_PROCESSING)
qDebug() << "TBuffer::processUtf8Sequence(...) 3rd byte in UTF-8 sequence is invalid!";
#endif
isValid = false;
isToUseReplacementMark = true;
} else if ((bufferData.at(pos) & 0x0F) == 0x0D) {
} else if ((bufferData.at(pos) & 0x0F) == 0x0D && (bufferData.at(pos + 1) & 0x20) == 0x20) {
// For 3 byte values the bits are distributed:
// Byte 1 Byte 2 Byte 3
// 1110ABCD 10DEFGHI 10JKLMNO A is MSB
Expand All @@ -3617,8 +3619,15 @@ bool TBuffer::processUtf8Sequence(const std::string& bufferData, const bool isFr
* followed by a low surrogate). The majority of UTF-16 encoder and decoder
* implementations translate between encodings as though this were the case
* and Windows allows such sequences in filenames."
*
* So test for and, considering the LS Nibble of first byte:
* * accept if LS Nibble of first byte is less than 0xD
* * accept if LS Nibble of first byte is greater than/equal to 0xE
* * otherwise (if LS Nibble of first byte IS 0xD)
* * accept if 6 LS Bits of second byte is 0x1F of or less
* Conversely this can be stated as:
* * reject if LS Nibble of first byte is 0xD AND 6th MS Bit of second byte is set
*/
// So test for and reject if LSN of first byte is 0xD!
#if defined(DEBUG_UTF8_PROCESSING)
qDebug() << "TBuffer::processUtf8Sequence(...) 3 byte UTF-8 sequence is a High or Low UTF-16 Surrogate and is not valid in UTF-8!";
#endif
Expand All @@ -3627,8 +3636,10 @@ bool TBuffer::processUtf8Sequence(const std::string& bufferData, const bool isFr
} else if ( (static_cast<quint8>(bufferData.at(pos + 2)) == 0xBF)
&& (static_cast<quint8>(bufferData.at(pos + 1)) == 0xBB)
&& (static_cast<quint8>(bufferData.at(pos )) == 0xEF)) {
// Got caught out by this one - it is the UTF-8 BOM and
// needs to be ignored as it transcodes to NO codepoints!

// Got caught out by this one - it is the UTF-8 BOM (or
// Zero-Width No-Break Space) and needs to be detected specially
// as Qt's codec ignores it and transcodes it to NO codepoints!
#if defined(DEBUG_UTF8_PROCESSING)
qDebug() << "TBuffer::processUtf8Sequence(...) UTF-8 BOM sequence seen and handled!";
#endif
Expand All @@ -3639,6 +3650,7 @@ bool TBuffer::processUtf8Sequence(const std::string& bufferData, const bool isFr
// Fall-through
[[fallthrough]];
case 2:
// Check the 2nd byte is a valid continuation byte (2 MS-Bits are 10)
if ((static_cast<quint8>(bufferData.at(pos + 1)) & 0xC0) != 0x80) {
#if defined(DEBUG_UTF8_PROCESSING)
qDebug() << "TBuffer::processUtf8Sequence(...) 2nd byte in UTF-8 sequence is invalid!";
Expand Down

0 comments on commit 2096535

Please sign in to comment.