Skip to content

Incorrect decoding from JS strings to USV strings #15324

Open
@wingo

Description

@wingo

Emscripten implements its own transcoder from JavaScript strings to UTF-8, and uses it when passing strings to C functions. https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L158. This transcoder logically decodes the stream of JS code units as UTF-16, producing an abstract stream of unicode scalar values, which it then encodes to UTF-8.

However if the JS string contains malformed UTF-16, the decode neither throws an error nor produces the replacement character U+FFFD.

If we extract the decoder, it looks like this:

function firstUSV(str) {
  var i = 0
  var u = str.charCodeAt(i); // possibly a lead surrogate
  if (u >= 0xD800 && u <= 0xDFFF) {
    var u1 = str.charCodeAt(++i);
    u = 0x10000 + ((u & 0x3FF) << 10) | (u1 & 0x3FF);
  }
    
  if (u > 0x10FFFF)
    throw new Error('Bad codepoint: 0x' + u.toString(16));

  return '0x' + u.toString(16);
}

Here are some erroneous examples:

> firstUSV('\uD800')
"0x10000"
> firstUSV('\uD800a')
"0x10061"
> firstUSV('\uDC00')
"0x10000"
> firstUSV('\uDC00\uD800')
"0x10000"
> firstUSV('\uDFFF\uDFFF')
"0x10ffff"

It doesn't seem that the assertion would ever fire for any codepoint values, but I am not sure about that.

Related to WebAssembly/gc#145.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions