Skip to content

surrogate Unicode code points in input stream #19

Closed
@jmdyck

Description

@jmdyck

In tokenizer/unicodeCharsProblematic.test, the first 4 tests involve input streams containing U+DFFF or U+D800. In each case, the expected output includes:
(1) a parse error, and
(2) a character token in which the code point in question has been replaced by U+FFFD.

Looking at various versions of the HTML5 spec, I can't find justifications for the parse error or the replacement.

The closest I found was the paragraph:

Otherwise, if the number is in the range 0xD800 to 0xDFFF
or is greater than 0x10FFFF, then this is a parse error.
Return a U+FFFD REPLACEMENT CHARACTER character token.

This would appear to be precisely the justification, except that it occurs in the "Tokenizing character references" section, and "the number" is the interpreted value of the numeral in a character reference. (E.g., the paragraph applies to an input stream containing the 8 characters & # x D 8 0 0 ; , not to an input stream containing the single code point U+D800.)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions