Skip to content

UTF8Encoding should support encoding/decoding of unpaired surrogates #14785

Open
@tmat

Description

According to RFC 3629 encoding/decoding unmatched surrogates should be disallowed:

"The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters."

However, this hasn't been followed by real world encoders/decoders. For example, the ECMA-335 standard encodes string arguments of custom attributes using UTF8 and the compilers allowed unpaired surrogates in the attribute argument. Another example is PDB - the file paths in PDB are stored as UTF8 encoded strings and unpaired surrogates are also allowed. The same for values of local string constants (e.g. const string surrogate = "\ud800").

To avoid breaking changes Roslyn needs to allow unpaired surrogates in the above cases and the MetadataReader should also use a variant of UTF8 encoding that is able to decode them. Currently Roslyn has a custom implementation of UTF8 encoder originating from CCI. In general, it seems that pragmatically a UTF16-UTF8 round-tripping is desirable in certain scenarios and UTF8Encoding should support it.

I propose to add a constructor to UTF8 Encoding that takes a bool allowUnpairedSurrogates (false by default) that can be used by both Roslyn and MetadataReader.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    api-needs-workAPI needs work before it is approved, it is NOT ready for implementationarea-System.Text.Encodinghelp wanted[up-for-grabs] Good issue for external contributors

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions