Description
In tokenizer/unicodeCharsProblematic.test, the first 4 tests involve input streams containing U+DFFF or U+D800. In each case, the expected output includes:
(1) a parse error, and
(2) a character token in which the code point in question has been replaced by U+FFFD.
Looking at various versions of the HTML5 spec, I can't find justifications for the parse error or the replacement.
The closest I found was the paragraph:
Otherwise, if the number is in the range 0xD800 to 0xDFFF
or is greater than 0x10FFFF, then this is a parse error.
Return a U+FFFD REPLACEMENT CHARACTER character token.
This would appear to be precisely the justification, except that it occurs in the "Tokenizing character references" section, and "the number" is the interpreted value of the numeral in a character reference. (E.g., the paragraph applies to an input stream containing the 8 characters & # x D 8 0 0 ; , not to an input stream containing the single code point U+D800.)