Open
Description
Emscripten implements its own transcoder from JavaScript strings to UTF-8, and uses it when passing strings to C functions. https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L158. This transcoder logically decodes the stream of JS code units as UTF-16, producing an abstract stream of unicode scalar values, which it then encodes to UTF-8.
However if the JS string contains malformed UTF-16, the decode neither throws an error nor produces the replacement character U+FFFD.
If we extract the decoder, it looks like this:
function firstUSV(str) {
var i = 0
var u = str.charCodeAt(i); // possibly a lead surrogate
if (u >= 0xD800 && u <= 0xDFFF) {
var u1 = str.charCodeAt(++i);
u = 0x10000 + ((u & 0x3FF) << 10) | (u1 & 0x3FF);
}
if (u > 0x10FFFF)
throw new Error('Bad codepoint: 0x' + u.toString(16));
return '0x' + u.toString(16);
}
Here are some erroneous examples:
> firstUSV('\uD800')
"0x10000"
> firstUSV('\uD800a')
"0x10061"
> firstUSV('\uDC00')
"0x10000"
> firstUSV('\uDC00\uD800')
"0x10000"
> firstUSV('\uDFFF\uDFFF')
"0x10ffff"
It doesn't seem that the assertion would ever fire for any codepoint values, but I am not sure about that.
Related to WebAssembly/gc#145.