UTF8Encoding should support encoding/decoding of unpaired surrogates

According to RFC 3629 encoding/decoding unmatched surrogates should be disallowed:

"The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent    characters."

However, this hasn't been followed by real world encoders/decoders. For example, the ECMA-335 standard encodes string arguments of custom attributes using UTF8 and the compilers allowed unpaired surrogates in the attribute argument. Another example is PDB - the file paths in PDB are stored as UTF8 encoded strings and unpaired surrogates are also allowed. The same for values of local string constants (e.g. `const string surrogate = "\ud800"`).

To avoid breaking changes Roslyn needs to allow unpaired surrogates in the above cases and the MetadataReader should also use a variant of UTF8 encoding that is able to decode them. Currently Roslyn has a custom implementation of UTF8 encoder originating from CCI. In general, it seems that pragmatically a UTF16-UTF8 round-tripping is desirable in certain scenarios and UTF8Encoding should support it.

I propose to add a constructor to UTF8 Encoding that takes a bool allowUnpairedSurrogates (false by default) that can be used by both Roslyn and MetadataReader.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8Encoding should support encoding/decoding of unpaired surrogates #14785

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development