-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8.read function producing wrong strings #1473
Comments
This fixes protobufjs#1473 The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk. It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly. Here's a repro of the existing encoding bug in a fuzzing suite https://repl.it/@turbio/oh-no-our-strings#decoder.js
* fix utf8 -> utf16 decoding bug on surrogate pairs This fixes #1473 The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk. It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly. Here's a repro of the existing encoding bug in a fuzzing suite https://repl.it/@turbio/oh-no-our-strings#decoder.js * fix lint * add test case for surrogate pair bug Co-authored-by: Alexander Fenster <fenster@google.com>
This is still not fixed with #1486. We got a very mysterious problem where extra � was added at almost even 8000 places sometimes in some of our messages. Turns out we only got it when we had emojis in the strings. We went the way of using TextDecoder too and it works every time. |
@daviderenger do you have a test case? |
|
* fix utf8 -> utf16 decoding bug on surrogate pairs This fixes protobufjs/protobuf.js#1473 The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk. It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly. Here's a repro of the existing encoding bug in a fuzzing suite https://repl.it/@turbio/oh-no-our-strings#decoder.js * fix lint * add test case for surrogate pair bug Co-authored-by: Alexander Fenster <fenster@google.com>
Encountered this too, @protobufjs/utf8 is still v1.1.0 and not applying the patch. Any update? EDIT: for anyone encountering this, apply the following diff --git a/index.js b/index.js
index e4ff8df9b7e83854df3ed09eb1cee8253af7c497..9dc6c05970cdf1734dddcea73ce261325f22abcc 100644
--- a/index.js
+++ b/index.js
@@ -38,36 +38,27 @@ utf8.length = function utf8_length(string) {
* @returns {string} String read
*/
utf8.read = function utf8_read(buffer, start, end) {
- var len = end - start;
- if (len < 1)
+ if (end - start < 1) {
return "";
- var parts = null,
- chunk = [],
- i = 0, // char offset
- t; // temporary
- while (start < end) {
- t = buffer[start++];
- if (t < 128)
- chunk[i++] = t;
- else if (t > 191 && t < 224)
- chunk[i++] = (t & 31) << 6 | buffer[start++] & 63;
- else if (t > 239 && t < 365) {
- t = ((t & 7) << 18 | (buffer[start++] & 63) << 12 | (buffer[start++] & 63) << 6 | buffer[start++] & 63) - 0x10000;
- chunk[i++] = 0xD800 + (t >> 10);
- chunk[i++] = 0xDC00 + (t & 1023);
- } else
- chunk[i++] = (t & 15) << 12 | (buffer[start++] & 63) << 6 | buffer[start++] & 63;
- if (i > 8191) {
- (parts || (parts = [])).push(String.fromCharCode.apply(String, chunk));
- i = 0;
- }
}
- if (parts) {
- if (i)
- parts.push(String.fromCharCode.apply(String, chunk.slice(0, i)));
- return parts.join("");
+
+ var str = "";
+ for (var i = start; i < end;) {
+ var t = buffer[i++];
+ if (t <= 0x7F) {
+ str += String.fromCharCode(t);
+ } else if (t >= 0xC0 && t < 0xE0) {
+ str += String.fromCharCode((t & 0x1F) << 6 | buffer[i++] & 0x3F);
+ } else if (t >= 0xE0 && t < 0xF0) {
+ str += String.fromCharCode((t & 0xF) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F);
+ } else if (t >= 0xF0) {
+ var t2 = ((t & 7) << 18 | (buffer[i++] & 0x3F) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F) - 0x10000;
+ str += String.fromCharCode(0xD800 + (t2 >> 10));
+ str += String.fromCharCode(0xDC00 + (t2 & 0x3FF));
+ }
}
- return String.fromCharCode.apply(String, chunk.slice(0, i));
+
+ return str;
};
/** |
protobuf.js version: 6.10.1
The
utf8.read
function seems to inserting extra unicode characters sometimes.Here's a repro (https://repl.it/@masfrost/pbjs-bad-decode) where I check
utf8.read
against WHATWGTextEncoder
. The repro file is not very minimal but this issue seems to be pretty common for us when decoding strings.FYI for future readers, we monkey patched the library and forced it to use TextDecoder/TextEncoder here https://github.com/replit/crosis/blob/v5.0.3/src/fixUtf8.ts
I think maybe using the standard TextEncoder/Decoder might be the best thing to do here, encoding is just too complicated and I'm sure these standard libraries are faster. Happy to put up a PR if that's an option, otherwise, I don't really have enough time to go splunking into utf8 land.
The text was updated successfully, but these errors were encountered: