-
Notifications
You must be signed in to change notification settings - Fork 57
Support UTF-16 as an additional encoding #136
Comments
First a general comment on a recent realization of why it's a good idea to have >1 strings encodings in the MVP's canonical ABI: The current plan of record is to start with only the canonical ABI and add custom ABI support via adapter functions after that. Once we get to the latter, instructions like The worry, though, is that, since the actually-reallocating path isn't exercised by the MVP, it will be broken in practice. E.g., in the MVP, you could get away with Regarding the actual design of supporting multiple encodings: it's already the case that the canonical ABI is intended to be parameterized by linear-memory-vs-gc-memory. This WASI presentation slide gives a concrete example of the same Interface Typed signature with the options Next, the canonical ABI shows up in two ways:
Thus, it's the same (parameterized) canonical ABI, just in two different contexts. Lastly, there's the questions of what encodings to actually support. If our goal here is to optimize UTF-16 languages, then I think it's important to realize that most production VMs are using a dual UTF-16/Latin-1 representations (sometimes called compact strings). Moreover, the UTF-16/Latin-1 choice is on a per-string basis, so even adding a
Thus, altogether, I think 3 canonical ABI parameter values make sense for It is worth asking whether all this additional complexity in the canonical ABI is worth it, and I've had different opinions on this over time but I think the "test |
First, +1 to UTF-16 support! I am slightly worried that a WTF-16-encoded binary string might get inadvertently corrupted if toolchains default to passing it around via a UTF-16 interface type that performs silent replacement of unpaired surrogates. Does it make sense to have the canonical lift function explicitly trap if an unpaired surrogate is encountered, or would that be too unfriendly? I remember this was already discussed for UTF-8, but the balance seems slightly different here since WTF-16 binary strings are more of a thing. I admit this is a fringe concern, since a careful toolchain setup can expose such strings as EDIT: is the plan to allow (implicit?) conversion between at least |
+1 from me as well, sounds very good! If a lossless alternative cannot find consensus, at AssemblyScript we would very likely fall back to canonical UTF-16 while documenting what can go wrong at Interface Types boundaries. We would prefer silent replacement so whole applications don't accidentally break. Compact strings are a nice addition that I could imagine to explore as well. |
I can definitely understand the desire to catch bugs, but implicit replacement of surrogates in WTF-16 content already appears to be the default experience when externalizing a WTF-16 string, especially on the Web, but also in a number of other cases I noticed while investigating other languages' transcoding paths. |
A couple of comments:
So I question that utf16 modes have the practical benefit that some folks here seem to assume. Their addition would be biased towards specific assumptions about JS implementations that do not reflect the common case. And as such, they only raise completely wrong expectations. (I understand the realloc argument, but it seems a bit odd to argue for the addition of a feature B on the sole basis that it enforces debugging of a feature A. (I believe there is a phrase for this kind of rationale, but I can't remember it right now :) ) |
@rossberg I'd argue that the "canonical" nature of the string ABI comes from fact that at the boundary between components, the I'm somewhat more lukewarm on |
I appreciate the excursion into complex string representations and non-standard optimizations that some VMs do. I think it's a little one-sided / too early to only look at the complex VMs of today, though, in that languages we want to compile to Wasm have independent requirements, say avoiding their half of re-encoding on the core module side before feeding into an adapter, which is unnecessary code and work, or generally keeping code bloat / runtime overhead low. Modules are frequently shipped over the wire, while VMs are not. As such, what is suggested here does help Wasm languages, while VMs can of course still optimize and adapt however they find appropriate. This can freely change anyhow. And who knows, perhaps one day someone will ask for ropes or slices (wouldn't slices already work?), could well be, but I haven't seen anyone asking for it yet. Apart from that, I think what's basically encoders/decoders for UTF-8/16/Latin-1 is an obvious start regardless.
Btw, I would have preferred a clarifying question instead :) |
I think that's a misunderstanding. Luke's presentation clearly talked about limiting the MVP to "canonical adapter functions", which implicitly define a "canonical ABI". The set of types did not change.
This would likely only help inside JS embeddings of Wasm, as most other host environments, including browsers themselves, predominantly use UTF-8. And for JS embeddings it would only help in one direction, going from X-compiled-to-Wasm to JS; for the inverse direction, it won't buy much in contemporary JS engines. And even when going to JS it only is significant assuming X does not itself have a smarter string representation; e.g. Java VMs like HotSpot also default to single byte string representations, so that UTF-16 will be rare. Additionally, this all assumes that UTF re-encoding is significantly more expensive than UTF validation + copying. Do we have evidence for that? So the use case seems rather narrow, and the benefit unclear.
Things like ropes are fairly established implementation techniques nowadays, not just in JS, as O(1) string concatenation is typically expected in scripting and other high-level languages.
Don't forget that we are talking about an "MVP". This obviously isn't minimal, and ITs are perfectly viable without. For the MVP, I think it's good advice to be extra wary of scope creep, bias, and premature optimisation. |
On the meaning of "canonical ABI": we can discuss whether the word "canonical" is the right one, but the defining characteristic here is that the lifting/lowering scheme (between core wasm and abstract interface-typed values) is baked into the engine, not programmable via adapter functions. This both avoids all the novel problems of how to do adapter functions and also simplifies the way this whole thing looks from the POV of a traditional toolchain.
From multiple points of view (spec, engine impl, toolchain impl): the addition of UTF-16 to the canonical ABI still maintains all the high-order-bit simplifications and should be a fairly modest delta in effort. (We can have a more concrete experience report on this in a few months.)
Yes, but on:
and that's the only place where Interface Types exist, so I think
IIUC, |
From WebAssembly/design#1419 (comment):
I am very interested in making this happen, as it would already be a considerable improvement for languages using a 16-bit Unicode representation. What I could imagine currently is having either separate instructions, an immediate (but then it may as well be separate instructions I guess) or a parameter. For example:
Is that what you had in mind? If not, I am of course very interested in the other options :)
It may also be worthwhile to consider
list.lift_latin1
, which corresponds to narrow UTF-16 (with the high zero bytes left out), as it is a common optimization strategy in UTF-16 languages (to save memory and better utilize the CPU cache when possible). I do not feel strongly about whether or not we need the latter in an MVP already, though.The text was updated successfully, but these errors were encountered: