Incorrect decoding from JS strings to USV strings

Emscripten implements its own transcoder from JavaScript strings to UTF-8, and uses it when passing strings to C functions.  https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L158.  This transcoder logically decodes the stream of JS code units as UTF-16, producing an abstract stream of unicode scalar values, which it then encodes to UTF-8.

However if the JS string contains malformed UTF-16, the decode neither throws an error nor produces the replacement character U+FFFD.

If we extract the decoder, it looks like this:

```js
function firstUSV(str) {
  var i = 0
  var u = str.charCodeAt(i); // possibly a lead surrogate
  if (u >= 0xD800 && u <= 0xDFFF) {
    var u1 = str.charCodeAt(++i);
    u = 0x10000 + ((u & 0x3FF) << 10) | (u1 & 0x3FF);
  }
    
  if (u > 0x10FFFF)
    throw new Error('Bad codepoint: 0x' + u.toString(16));

  return '0x' + u.toString(16);
}
```

Here are some erroneous examples:
```js
> firstUSV('\uD800')
"0x10000"
> firstUSV('\uD800a')
"0x10061"
> firstUSV('\uDC00')
"0x10000"
> firstUSV('\uDC00\uD800')
"0x10000"
> firstUSV('\uDFFF\uDFFF')
"0x10ffff"
```

It doesn't seem that the assertion would ever fire for any codepoint values, but I am not sure about that.

Related to https://github.com/WebAssembly/gc/issues/145.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect decoding from JS strings to USV strings #15324

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect decoding from JS strings to USV strings #15324

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions