surrogate Unicode code points in input stream

In tokenizer/unicodeCharsProblematic.test, the first 4 tests involve input streams containing U+DFFF or U+D800. In each case, the expected output includes:
(1) a parse error, and
(2) a character token in which the code point in question has been replaced by U+FFFD.

Looking at various versions of the HTML5 spec, I can't find justifications for the parse error or the replacement.

The closest I found was the paragraph:

> Otherwise, if the number is in the range 0xD800 to 0xDFFF
> or is greater than 0x10FFFF, then this is a parse error.
> Return a U+FFFD REPLACEMENT CHARACTER character token.

This would appear to be precisely the justification, except that it occurs in the "Tokenizing character references" section, and "the number" is the interpreted value of the numeral in a character reference. (E.g., the paragraph applies to an input stream containing the 8 characters & # x D 8 0 0 ; , not to an input stream containing the single code point U+D800.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

surrogate Unicode code points in input stream #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

surrogate Unicode code points in input stream #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions