Skip to content
This repository was archived by the owner on Aug 17, 2022. It is now read-only.
This repository was archived by the owner on Aug 17, 2022. It is now read-only.

Why should strings be lists of Unicode Scalar Values? #135

Closed
@lukewagner

Description

@lukewagner

This issue lays out the reasoning for why I think strings should be lists of Unicode Scalar Values (as currently written in the explainer). This is a fairly nuanced question with the reasoning currently scattered around a number of issues, repos and specs, so I thought it would be useful to collect it all into one focused issue for discussion. The issue reflects discussions with a bunch of folks recently and over the years (@annevk, @hsivonen, @sunfishcode, @fgmccabe, @tschneidereit, @domenic), so I won’t claim credit for the reasoning. Also, to be clear, this issue only answers half of the overall question about string encoding, but I think it’s the first question we have to answer before we can meaningfully talk about string encodings.

(Note: I intend to update the OP in-place if there are any inaccuracies so that it represents a coherent argument.)

First, a bit of context:

Current proposal

As background, the Unicode standards provides two relevant definitions:

  • Code Point: an integer referring to the Unicode codespace in the range [0, 0x10FFFF].
  • Unicode Scalar Value (USV): a Code Point other than a surrogate and thus an integer in one of ranges [0, 0xD7FF] or [0xE000, 0x10FFFF].

Based on these definitions, the current explainer proposes:

  • The char interface type is a USV
  • The string interface type is an abbreviation for list char

Thus, string, as currently proposed, contains no surrogates (not just no lone surrogates). For reference: a pair of surrogate Code Units in a valid UTF-16 string is decoded into a single USV and thus valid UTF-16-encoded strings will never decode to strings containing any surrogates.

This is not an encoding or in-memory-representation question

The question of whether strings are lists of Unicode Scalar Values is not a question of encoding or memory representation; rather, it’s a question of: “what are the abstract string values produced by decoding and consumed by encoding?”. Without precisely defining what the set of possible abstract string values is, we can’t even begin to discuss string encoding/decoding since we don’t even know what it is we’re trying to encode or decode. This is especially true in the context of Interface Types, where our goal is to support (via adapter functions) fully programmable encoding/decoding in the future.

Thus, if we’re talking about the abstract strings represented by languages like Java, JS and C#, we’re not talking about “WTF-16” (which is an encoding); we’re talking about “lists of code points not containing surrogate pairs (but potentially containing lone surrogates)”, which for brevity I’ll call Wobbly strings, since Wobbly strings are what a Java/JS/C# string can be faithfully decoded into and encoded from. In particular, a Wobbly string can be encoded by either WTF-8 or WTF-16. Note that the set of Wobbly strings is subtly different and smaller than “lists of Code Points” because surrogate pairs decode into necessarily-non-surrogate code points, so there is no way for a Java/JS/C# string to decode into a surrogate pair. The only major languages I know of whose abstract strings are actually “lists of Code Points” are Python 3 and Haskell.

This is a Component Model question

As of our recent CG-05-25 polls, the Interface Types proposal now has the goals and requirements of the Component Model (as presented and summarized). Concretely, this means we’re explicitly concerned with cross-language/toolchain composition, virtualizability and embeddability, which means we’re very much concerned with whether interfaces using string will be consumable and implementable by a wide variety of languages and hosts with robust, portable behavior. Thus, use cases exclusively focused on particular combinations of languages+hosts may need to be solved by separate proposals targeting those specific languages+hosts if they are in conflict with the explicit goals of broad language/host interoperability.

With all this context in place, I’ll finally get to the reasons for defining string to be a list of USVs:

Reason 1: many languages have no good way to consume surrogates

I think there are a few categories of affected languages (this is based on brief spelunking, so let me know if I got this wrong and I’ll update it):

First, there are languages that simply fix UTF-8 for their built-in string type, in some cases exposing UTF-8 representation details directly in their string operations. The popular languages I found in this category are: Elixir, Julia, Rust and Swift.

Second, there are languages which define strings as “arbitrary arrays of bytes”, leaving the interpretation up to the library functions that operate on them. For the languages in this category that I looked into, the default encoding (for source text and string literals and sometimes built-in syntax like iteration) is increasingly implicitly assumed to be UTF-8 (due to the fact that, as detailed below, most I/O data is UTF-8). While it may seem like these languages have the most flexibility (and thus ability to accommodate surrogates), when porting existing code, the implicit dependency on UTF-8 (in the form of calls to UTF-8-assuming library functions scattered around the codebase) makes targeting anything other than UTF-8 challenging. The popular languages I found in this category are: C/C++, Go, Lua, PHP and Zig.

Third, there are languages that support a variety of encodings and conversion between them, but still disallow surrogates (among other reasons being that they aren’t generally transcodable). The popular languages I found in this category are: R and Ruby.

In all of these categories, the author of the toolchain that is binding the language to the Interface Types string has no great general option for what to do when given a surrogate:

  • Make incoming surrogates trap. This approach is attractive as it simply makes surrogates “someone else’s fault”, and thus not a corner case that all code in the language’s ecosystem has to worry about. This is an easy answer to pick, however it would make these languages second-class in the component ecosystem because they wouldn't be able to implement the same APIs.
  • Replace incoming surrogates with the replacement character. This happens by default in many places in many of the above languages that I saw, so it’s also a reasonable default option that avoids putting any burden on the language ecosystem at large. But, as with the previous option, this would make these languages second-class as they wouldn’t be able to faithfully implement the same APIs as other languages.
  • Produce non-UTF-8 byte strings. This isn’t possible for languages in the first and third categories and risky for languages in the second, due to the increasingly prevalent implicit assumption of UTF-8 noted above. Moreover, unlike the above two options, this is not a “spot fix”: it requires all ported code to use the appropriate non-UTF-8 string operations.
  • Escape surrogates into valid strings. This could make various simple round-tripping use cases Just Work, without hitting the above snags, but this option implicitly introduces a new micro-format that will need to be supported by any non-trivial string operation that works with the contents of the string (e.g., file system operations), so it’s also not a “spot fix”; it needs ecosystem adoption. Also, escaping can introduce collisions (leading to data corruption) with pre-existing strings since there are no code point sequences reserved for this purpose.
  • Produce a non-standard/builtin string. This option either requires large-scale changes (converting whole codebases to using the new, non-standard string), which blocks porting use cases, or requires a coercion some time later into the standard string, which means picking one of the above options.

For any particular use case, one of these options may be obvious. However, toolchains have to handle the general case, providing good defaults. In addition to the variable ecosystem cost of the different options, there is also a fixed non-negligible cost in wasted time for the N teams working on the N language toolchains, each of which will have to page in this whole problem space and wade through the space of options. In contrast, with a list of USVs, all the above languages can just do the obvious thing they’re already doing.

Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates

A common use of Interface Types will be to describe I/O APIs (e.g., for passing data over networks or reading/writing different media formats). Additionally, several of the component model’s virtualizability use cases involve mocking non-I/O APIs in terms of I/O (e.g., turning a normal import call into an RPC, logging call parameters and results, etc). In both these cases, surrogates run in direct conflict with the binary formats of most existing standard network protocols and standard media formats.

In particular, just considering Web-relevant character sets:

  • The RFC 2277: IETF Policy on Character Sets and Languages specifies that “When using other charsets than UTF-8, these MUST be registered in the IANA charset registry, if necessary by registering them when the protocol is published.”, and there are no IANA charsets that include surrogates.
  • The W3C Architectural Specification specifically calls out “Specifications MUST NOT allow the use of surrogate code points.”
  • The preface of the WHATWG Encoding Standard says “The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.”
  • If a format is to support line-breaking or collation, the Unicode specification says the behavior given surrogates is undefined, possibly resulting in an error.
  • All XML-based formats reject surrogates.
  • Popular RPC protocols such as Protobufs and CapnProto mandate UTF-8 for strings/text.
  • GraphQL strings mandate UTF-8.

On the Web, new APIs and formats created over the last 10 years simply mandate UTF-8, including:

  • RFC 8259, for all JSON documents that aren’t shared as part of a closed ecosystem.
  • JS files loaded in newer Worker and ES Module contexts.
  • The WebSockets text stream APIs.
  • The json and text getter functions of fetch, XHR and Blob APIs.

There’s also a recent proposal to make this direction more-officially part of the W3C’s design principles.

Thus, insofar as a string needs to transit over any of these protocols, formats or APIs, surrogates will be a problem and the implementer of the mapping will have roughly the same bad options listed above as the language toolchains have.

While it’s tempting to say “that’s just a specific precondition of particular APIs, not the string type’s problem”, the virtualization goals of the component model mean that any interface might be virtualized, so the fact that a string is being used for one of the above is not a detail of the API. In contrast, all these protocols and formats can easily represent lists of USVs.

Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary

Because of the above two reasons, from the perspective of a WTF-16-language-implemented component, it is a very risky proposition to pass a surrogate across a component boundary (parameter of an import or result of an export). Why? Because there’s no telling whether the other side will trap, convert the surrogate into a replacement character, get mangled or trigger undefined/untested behavior. As an author of a component, there’s also not a fixed set of clients or hosts (that’s the point of components).

Thus, to produce widely-reusable and portable components, even a toolchain for a language that allows lone surrogates would be advised to conservatively scrub these before passing strings to the outside world. In a sense, this is nothing new on the Web: despite JSON being derived from JS, JSON doesn’t allow surrogates while JS does, thus there is an inherent scrubbing process that happens when JS communicates with the outside world via JSON (and similarly with WebSockets, fetch(), etc). Accordingly, the WTF spec specifically advises against ever being used outside of “self-contained systems”.

As an illustrative example: consider instead defining string to be a list of Code Points. As explained above, this would mean string was a superset of the Wobbly strings supported by Java/JS/C#. Why might we do this? For one thing, it would capture the full expressive range of Python 3 and Haskell strings and APIs (which is the same argument for supporting Wobbly strings, just for a smaller set of languages). For another, it would give us a simple definition of char (= Code Point) and string (= list char), which has a number of practical benefits (in contrast to Wobbly strings, which cannot be a “list char” for any definition of “char”). However, now the vast majority of languages and hosts would have to resort to a variant of the abovementioned workarounds which means Python 3 and Haskell would have a Bad Time attempting to actually take advantage of this increased string expressivity. Thus, there would be a distributed cost without a commensurate distributed benefit. I think the situation is the same with Wobbly strings, even if the partitioning of languages is different.

What about binary data in strings?

One potential argument for surrogates is that they may be necessary to capture arbitrary binary data, particularly on the Web. To speak to this concern, it’s important to first clarify something: Web IDL has a ByteString type that is used for APIs (predominantly HTTP header methods like Headers.get()), where a ByteString is intentionally an arbitrary array of bytes. However, ByteString does this not by interpreting a JS string as a raw array of uint16s (which would have a problem representing byte strings of odd length), but by requiring each JS string element (a uint16 value) to be in the range [0, 255], throwing an exception otherwise. Since surrogates are outside the range [0, 255], this means that the one place in the Web Platform where binary data is actually appropriate, surrogates are irrelevant.

Outside ByteString use cases, there’s still a theoretical possibility of wanting to round-trip binary data through DOMString APIs. Talking to folks who have worked for years on the Web IDL and Encoding specs (@annevk, @hsivonen, @domenic), they’re not aware of any valid use cases for such usage of DOMString. Indeed, the TextDecoder API does not provide any way to produce a non-USVString, due to this same lack of use cases. In fact, there is currently no direct way (i.e., not involving String.fromCharCode et al) to decode an array of bytes into a non-USVString on the Web Platform today.

Instead, the natural way for a component to pass binary data is a list u8 or list u16, using JS glue code to convert the byte array into a JS string. If these use cases were found and found to be on performance sensitive paths in real workloads on the Web, then it seems like a Web-specific solution would be appropriate, and I can think of a number of options for how to optimize this path by adding things to the JS API. But ultimately, as an optimization, I don’t think this is something we should preemptively add without compelling data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions