-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Background and motivation
I ran into this recently where I need to check both UTF-8 strings & UTF-16 strings are valid / well-formed. I found Utf8.IsValid
for UTF-8, but there doesn't seem to be an equivalent API for UTF-16. The API would check if all surrogates are matched up properly.
Additionally, as linked to me by @MihaZupan, the runtime has at least 3 places where does stuff similar to this (the first is the same but inverted, and the other 2 are slightly different):
runtime/src/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/StringSearchValues.cs
Line 483 in 167d862
private static bool ContainsIncompleteSurrogatePairs(ReadOnlySpan<string> values) // Replace any invalid surrogate chars. runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/Normalization.Icu.cs
Line 221 in 167d862
private static bool HasInvalidUnicodeSequence(ReadOnlySpan<char> s)
API Proposal
namespace System.Text.Unicode;
public static class Utf16
{
public static bool IsValid(ReadOnlySpan<char> value) => ...;
}
(This design matches our Utf8
class, but only the 1 method)
OR
namespace System;
public sealed class String
{
public bool IsValid() => ...;
public static bool IsValid(ReadOnlySpan<char> value) => ...;
}
(This design matches what we've done with GetHashCode
& similar APIs)
API Usage
var isValid = Utf16.IsValid(mySpan);
if (!IsValid) ... complain to caller, or return false (depending on which API it is)
Alternative Designs
API review should decide whether we call it IsWellFormed
or IsValid
.
Make me write this myself:
while (true)
{
var idx = value.IndexOfAnyInRange('\uD800', '\uDFFF');
if (idx < 0) return true;
surrogate:
if ((uint)value.Length < (uint)idx + 2) return false;
if (!char.IsHighSurrogate(value[idx]) || !char.IsLowSurrogate(value[idx + 1])) return false;
value = value[(idx + 2)..];
idx = 0;
if (value.Length > 0 && char.IsSurrogate(value[0])) goto surrogate;
}
Risks
None.