Description
Proposal
.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via StringComparer.OrdinalIgnoreCase
). The current implementation of this API is to call ToUpperInvariant
on each string, then compare the resulting uppercase strings for bitwise equality.
.NET Core also provides methods to convert char
s, Rune
s, and string
s to uppercase or lowercase using the "invariant" culture (ToUpperInvariant
/ ToLowerInvariant
). The current implementation of this API is to p/invoke NLS on Windows or ICU on non-Windows.
I propose changing the logic so that .NET Core carries its own copy of the ICU "simple" casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an OrdinalIgnoreCase
comparison, and it would affect string casing only when using CultureInfo.InvariantCulture
.
Justification
Today, when processing UTF-8 data in a case-insensitive manner (such as via Equals(..., OrdinalIgnoreCase)
), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we'd be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such as string.ToUpperInvariant
and string.ToLowerInvariant
since we'd be able to optimize those calls.
As mentioned earlier, today's casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)
Here are some concrete examples demonstrating the problems:
// 'ß' is U+00DF LATIN SMALL LETTER SHARP S
// 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S
string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes
string toLower = "ẞ".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"
bool areEqual = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase); // returns "False" on Windows, otherwise "True"
With this proposal, the code above will behave the same across all OSes. They would follow what is today's non-Windows behavior. They'd be locked to whatever version of the Unicode data we include in the product as part of the CharUnicodeInfo
class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within the CharUnicodeInfo
class follows the Unicode Standard 11.0.0.
Breaking change discussion
Affected APIs:
string
/char
/Rune
equality methods or hash code generation routines which takeStringComparison.OrdinalIgnoreCase
as a parameter. All other comparisons are unchanged.string
/char
/Rune
case changing methods whenCultureInfo.InvariantCulture
is provided. All other cultures are unchanged.- Extension methods on
ReadOnlySpan<char>
which provide equivalent functionality to the above. StringComparer.OrdinalIgnoreCase
. All otherStringComparer
instances are unchanged.- Case changing methods on
CultureInfo.InvariantCulture.TextInfo
.
If GlobalizationMode.Invariant
is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.
Applications which depend on OrdinalIgnoreCase
equality being stable may be affected by this proposed change. That is, if an application relies on "ß"
and "ẞ"
being not equal under an OrdinalIgnoreCase
comparer, that application is likely to experience odd behavior in the future.
In general, applications cannot rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned false might start returning true:
string a = "ꝍ"; // U+A74D
string b = "Ꝍ"; // U+A74C
// today, may be "True" or "False" depending on which Windows version the app is running on.
// with this proposal, always returns "True"
bool areEqual = string.Equals(a, b, StringComparison.OrdinalIgnoreCase);
Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it's using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.
There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using OrdinalIgnoreCase
might compare as not equal using InvariantCultureIgnoreCase
. For example:
// with this proposal, returns "True" across all OSes
bool equalsOIC = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase);
// with this proposal, returns "False" on Windows, "True" otherwise
bool equalsICIC = "ß".Equals("ẞ", StringComparison.InvariantCultureIgnoreCase);
I don't expect this to trip up most applications because I don't believe it to be common for an application to compare a string pair using two different comparers, but it is worth pointing out as a curious edge case.
This may also lead to a discrepancy between managed code which uses StringComparison.OrdinalIgnoreCase
and unmanaged code (including within the runtime) which uses CompareStringOrdinal
on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.