Skip to content

[API Proposal]: UTF-16 version of Utf8.IsValid #118018

@hamarb123

Description

@hamarb123

Background and motivation

I ran into this recently where I need to check both UTF-8 strings & UTF-16 strings are valid / well-formed. I found Utf8.IsValid for UTF-8, but there doesn't seem to be an equivalent API for UTF-16. The API would check if all surrogates are matched up properly.

Additionally, as linked to me by @MihaZupan, the runtime has at least 3 places where does stuff similar to this (the first is the same but inverted, and the other 2 are slightly different):

API Proposal

namespace System.Text.Unicode;

public static class Utf16
{
    public static bool IsValid(ReadOnlySpan<char> value) => ...;
}

(This design matches our Utf8 class, but only the 1 method)

OR

namespace System;

public sealed class String
{
    public bool IsValid() => ...;
    public static bool IsValid(ReadOnlySpan<char> value) => ...;
}

(This design matches what we've done with GetHashCode & similar APIs)

API Usage

var isValid = Utf16.IsValid(mySpan);
if (!IsValid) ... complain to caller, or return false (depending on which API it is)

Alternative Designs

API review should decide whether we call it IsWellFormed or IsValid.

Make me write this myself:

while (true)
{
	var idx = value.IndexOfAnyInRange('\uD800', '\uDFFF');
	if (idx < 0) return true;
	surrogate:
	if ((uint)value.Length < (uint)idx + 2) return false;
	if (!char.IsHighSurrogate(value[idx]) || !char.IsLowSurrogate(value[idx + 1])) return false;
	value = value[(idx + 2)..];
	idx = 0;
	if (value.Length > 0 && char.IsSurrogate(value[0])) goto surrogate;
}

Risks

None.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions