UTF8Encoding should support encoding/decoding of unpaired surrogates #14785
Description
According to RFC 3629 encoding/decoding unmatched surrogates should be disallowed:
"The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters."
However, this hasn't been followed by real world encoders/decoders. For example, the ECMA-335 standard encodes string arguments of custom attributes using UTF8 and the compilers allowed unpaired surrogates in the attribute argument. Another example is PDB - the file paths in PDB are stored as UTF8 encoded strings and unpaired surrogates are also allowed. The same for values of local string constants (e.g. const string surrogate = "\ud800"
).
To avoid breaking changes Roslyn needs to allow unpaired surrogates in the above cases and the MetadataReader should also use a variant of UTF8 encoding that is able to decode them. Currently Roslyn has a custom implementation of UTF8 encoder originating from CCI. In general, it seems that pragmatically a UTF16-UTF8 round-tripping is desirable in certain scenarios and UTF8Encoding should support it.
I propose to add a constructor to UTF8 Encoding that takes a bool allowUnpairedSurrogates (false by default) that can be used by both Roslyn and MetadataReader.
Activity