Skip to content

Make decode() and encodeInto() accept SharedArrayBuffer-backed views #172

Closed
@juj

Description

@juj

In conjunction to WebAssembly and multithreading, there is a need for a new TextDecoder.decode() API for converting e.g. UTF-8 encoded strings in a typed array to a JavaScript string.

Currently to convert a string in WebAssembly heap to a JS string, one can do

// init:
var textDecoder = new TextDecoder("utf8");
var wasmHeap = new UintArray(...); // coming from Wasm instantiation

// use:
var pointerToUtf8EncodedStringInHeap = 0x421341; // A UTF-8 encoded C string residing on the heap
var stringNullByteIndex = pointerToUtf8EncodedStringInHeap;
while(wasmHeap[stringNullByteIndex] != 0) ++stringNullByteIndex;
var jsString = textDecoder.decode(wasmHeap.subarray(pointerToUtf8EncodedStringInHeap, stringNullByteIndex);

There are three shortcomings with this API that are bugging Emscripten/asm.js/WebAssembly uses:

  1. TextDecoder.decode() does not work with SharedArrayBuffer, so the above fails if wasmHeap viewed a SharedArrayBuffer in a multithreaded WebAssembly program.

  2. TextDecoder.decode() needs a TypedArrayView, and it always converts the whole view. As result, one has to call wasmHeap.subarray() on the large wasm heap to generate a small view that only encompasses the portion of the memory that the contains the string. This generates temporary garbage that would be needless with a more appropriate API.

  3. The semantics of TextDecoder.decode() are to always convert the whole input view that is passed to the function. This means that if there exists a null byte \0 in the middle of the view, the generated JS string will have a null UTF-16 code point in it in the middle. I.e. decoding will not stop when the first null byte is found, but continues on from there. This has the effect that in order to use the API from a WebAssembly program that is dealing with null-terminated C strings, JavaScript or Wasm code must first scan the whole string to find the first null byte. This is harmful for performance when dealing with long strings. It would be better to have an API where the decode size that is specified would be a maxBytesToRead style of size, instead of exact size. That way JS/WebAssembly did not need to pre-scan through each string to find how long the string actually is, improving performance.

This kind of code often occurs in compiled C programs, which already provide max sizes of their buffers to deal against buffer overflows in C code. That is,

char str[256] = ...;
UTF8ToString(str, sizeof(str)); // Convert C string to JS string, but provide a max cap for the buffer that cannot be exceeded

or

size_t len = 256;
char *str = malloc(len);
UTF8ToString(str, len);

Having to do a O(N) scan to figure out a buffer overflow guard bound would not be ideal.

It would be good to have a new function on TextDecoder, e.g.

TextDecoder.decodeRange(ArrayBuffer|SharedArrayBuffer|TypedArrayView, startIndex, [optional: maxElementsToRead]);`

which would allow reading from SharedArrayBuffers, took in startIdx to the array, and optionally the max number of elements to read. This is parallel to what was done to WebGL 2, with the advent of WebAssembly and multithreading: all entry points in WebGL 2 dealing with typed arrays accumulated a new variant of the function that take in SharedArrayBuffers and do not produce temporary garbage: https://www.khronos.org/registry/webgl/specs/latest/2.0/#3.7

Also in case of WebGL, all API entry points were retroactively re-specced to allow SharedArrayBuffers and SharedArrayViews in addition to regular typed arrays and views. That would be nice to happen with decode() as well, although if so, it should probably happen exactly at the same time as a new function decodeRange() was added, so that code can feature test via the presence of decodeRange() if the old decode() function has been improved or not.

With such a new decodeRange() function, the JS code at the very top would transform to

// init:
var textDecoder = new TextDecoder("utf8");
var wasmHeap = new UintArray(...); // coming from Wasm instantiation

// use:
var pointerToUtf8EncodedStringInHeap = 0x421341; // A UTF-8 encoded C string residing on the heap
var jsString = textDecoder.decodeRange(wasmHeap, pointerToUtf8EncodedStringInHeap);

which would work with multithreading, improve performance, and be smaller code size.

The reason why this is somewhat important is that with Emscripten and WebAssembly, marshalling strings across wasm and JS language barriers is really common, something most Emscripten compiled applications are doing, and in the absence of multithreading capable text marshalling, applications that need string marshalling have to resort to a manual JS side implementation that loops and appends String.fromCharCode()s character by character:

https://github.com/emscripten-core/emscripten/blob/c2b3c49f71ab98fbd9ff829d6cbd30445b56a93e/src/runtime_strings.js#L98

It would be good for that code to be able to go away.

CC @kripken, @dschuff, @lars-t-hansen, @binji, @lukewagner, @titzer, @bnjbvr, @aheejin , who have been working on WebAssembly multithreading.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions