-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
JSON.stringify is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and that QuoteJSONString does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.
The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking js (SpiderMonkey) or node:
$ js -e 'print(JSON.stringify("\u{10000}".split("")));' | js -e 'print(JSON.parse(readline()).join("") === "\u{10000}");'
false
$ node -e 'console.log(JSON.stringify("\u{10000}".split("")));' | node -e 'require("readline").createInterface({ input: process.stdin, terminal: false }).on("line", l => { console.log(JSON.parse(l).join("") === "\u{10000}"); });'
false
I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for QuoteJSONString, before the final "Else" that currently exists.
else if C is a low surrogate (C >= 0xD800 && C < 0xDC00),
concatenate the \uHHHH notation of C to product
else if C is a high surrogate (C >= 0xDC00 && C < 0xE000),
if it is the last code unit,
concatenate the \uHHHH notation of C to product
else,
let N be the next code unit
if N is a low surrogate (N >= 0xD800 && N < 0xDC00),
concatenate C to product
concatenate N to product
progress iteration to the next code unit (N)
else,
concatenate the \uHHHH notation of C to product
Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.