Skip to content

JSON.stringify produces invalid UTF-16 #944

@Maxdamantus

Description

@Maxdamantus

JSON.stringify is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and that QuoteJSONString does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.

The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking js (SpiderMonkey) or node:

$ js -e 'print(JSON.stringify("\u{10000}".split("")));' | js -e 'print(JSON.parse(readline()).join("") === "\u{10000}");'
false

$ node -e 'console.log(JSON.stringify("\u{10000}".split("")));' | node -e 'require("readline").createInterface({ input: process.stdin, terminal: false }).on("line", l => { console.log(JSON.parse(l).join("") === "\u{10000}"); });'
false

I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for QuoteJSONString, before the final "Else" that currently exists.

else if C is a low surrogate (C >= 0xD800 && C < 0xDC00),
	concatenate the \uHHHH notation of C to product
else if C is a high surrogate (C >= 0xDC00 && C < 0xE000),
	if it is the last code unit,
		concatenate the \uHHHH notation of C to product
	else,
		let N be the next code unit
		if N is a low surrogate (N >= 0xD800 && N < 0xDC00),
			concatenate C to product
			concatenate N to product
			progress iteration to the next code unit (N)
		else,
			concatenate the \uHHHH notation of C to product

Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalThis is related to a specific proposal, and will be closed/merged when the proposal reaches stage 4.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions