Skip to content

Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms #30960

Open

Description

Proposal

.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via StringComparer.OrdinalIgnoreCase). The current implementation of this API is to call ToUpperInvariant on each string, then compare the resulting uppercase strings for bitwise equality.

.NET Core also provides methods to convert chars, Runes, and strings to uppercase or lowercase using the "invariant" culture (ToUpperInvariant / ToLowerInvariant). The current implementation of this API is to p/invoke NLS on Windows or ICU on non-Windows.

I propose changing the logic so that .NET Core carries its own copy of the ICU "simple" casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an OrdinalIgnoreCase comparison, and it would affect string casing only when using CultureInfo.InvariantCulture.

Justification

Today, when processing UTF-8 data in a case-insensitive manner (such as via Equals(..., OrdinalIgnoreCase)), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we'd be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such as string.ToUpperInvariant and string.ToLowerInvariant since we'd be able to optimize those calls.

As mentioned earlier, today's casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)

Here are some concrete examples demonstrating the problems:

// 'ß' is U+00DF LATIN SMALL LETTER SHARP S
// 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S

string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes
string toLower = "".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"
bool areEqual = "ß".Equals("", StringComparison.OrdinalIgnoreCase); // returns "False" on Windows, otherwise "True"

With this proposal, the code above will behave the same across all OSes. They would follow what is today's non-Windows behavior. They'd be locked to whatever version of the Unicode data we include in the product as part of the CharUnicodeInfo class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within the CharUnicodeInfo class follows the Unicode Standard 11.0.0.

Breaking change discussion

Affected APIs:

  • string / char / Rune equality methods or hash code generation routines which take StringComparison.OrdinalIgnoreCase as a parameter. All other comparisons are unchanged.
  • string / char / Rune case changing methods when CultureInfo.InvariantCulture is provided. All other cultures are unchanged.
  • Extension methods on ReadOnlySpan<char> which provide equivalent functionality to the above.
  • StringComparer.OrdinalIgnoreCase. All other StringComparer instances are unchanged.
  • Case changing methods on CultureInfo.InvariantCulture.TextInfo.

If GlobalizationMode.Invariant is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.

Applications which depend on OrdinalIgnoreCase equality being stable may be affected by this proposed change. That is, if an application relies on "ß" and "ẞ" being not equal under an OrdinalIgnoreCase comparer, that application is likely to experience odd behavior in the future.

In general, applications cannot rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned false might start returning true:

string a = ""; // U+A74D
string b = ""; // U+A74C

// today, may be "True" or "False" depending on which Windows version the app is running on.
// with this proposal, always returns "True"
bool areEqual = string.Equals(a, b, StringComparison.OrdinalIgnoreCase);

Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it's using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.

There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using OrdinalIgnoreCase might compare as not equal using InvariantCultureIgnoreCase. For example:

// with this proposal, returns "True" across all OSes
bool equalsOIC = "ß".Equals("", StringComparison.OrdinalIgnoreCase);

// with this proposal, returns "False" on Windows, "True" otherwise
bool equalsICIC = "ß".Equals("", StringComparison.InvariantCultureIgnoreCase);

I don't expect this to trip up most applications because I don't believe it to be common for an application to compare a string pair using two different comparers, but it is worth pointing out as a curious edge case.

This may also lead to a discrepancy between managed code which uses StringComparison.OrdinalIgnoreCase and unmanaged code (including within the runtime) which uses CompareStringOrdinal on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

area-System.Runtimebreaking-changeIssue or PR that represents a breaking API or functional change over a prerelease.design-discussionOngoing discussion about design without consensus

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions