Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms

## Proposal

.NET Core provides APIs to compare strings for ordinal case-insensitive equality (such as via `StringComparer.OrdinalIgnoreCase`). The current implementation of this API is to call `ToUpperInvariant` on each string, then compare the resulting uppercase strings for bitwise equality.

.NET Core also provides methods to convert `char`s, `Rune`s, and `string`s to uppercase or lowercase using the "invariant" culture (`ToUpperInvariant` / `ToLowerInvariant`). The current implementation of this API is to p/invoke [NLS](https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex) on Windows or [ICU](https://ssl.icu-project.org/apiref/icu4c/) on non-Windows.

I propose changing the logic so that .NET Core carries its own copy of the ICU "simple" casing tables, and we consult our copies of those tables on all operating systems. This would affect string comparison only when using an `OrdinalIgnoreCase` comparison, and it would affect string casing only when using `CultureInfo.InvariantCulture`.

## Justification

Today, when processing UTF-8 data in a case-insensitive manner (such as via `Equals(..., OrdinalIgnoreCase)`), we must first transcode the data to UTF-16 so that it can go through the normal p/invoke routines. This transcoding and p/invoke adds unnecessary overhead. With this proposal, we'd be able to consult our own local copies of the casing table, which eliminates much of this overhead and streamlines the comparison process. This performance boost should also be applicable to existing UTF-16 APIs such as `string.ToUpperInvariant` and `string.ToLowerInvariant` since we'd be able to optimize those calls.

As mentioned earlier, today's casing tables involve p/invoking NLS or ICU, depending on platform. This means that comparison / invariant casing APIs could provide different results on different operating systems. Even within the same operating system family, the casing tables can change based on OS version. (Windows 10 1703 has different casing tables than Windows 10 1903, for instance.)

Here are some concrete examples demonstrating the problems:

```cs
// 'ß' is U+00DF LATIN SMALL LETTER SHARP S
// 'ẞ' is U+1E9E LATIN CAPITAL LETTER SHARP S

string toUpper = "ß".ToUpperInvariant(); // returns "ß" on all OSes
string toLower = "ẞ".ToLowerInvariant(); // returns "ẞ" on Windows, otherwise "ß"
bool areEqual = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase); // returns "False" on Windows, otherwise "True"
```

With this proposal, the code above will behave the same across all OSes. They would follow what is today's non-Windows behavior. They'd be locked to whatever version of the Unicode data we include in the product as part of the `CharUnicodeInfo` class. This data changes each release to reflect recent modifications to the Unicode Standard. As of this writing, the data contained within the `CharUnicodeInfo` class follows the Unicode Standard 11.0.0.

## Breaking change discussion

Affected APIs:

 * `string` / `char` / `Rune` equality methods or hash code generation routines which take `StringComparison.OrdinalIgnoreCase` as a parameter. All other comparisons are unchanged.
 * `string` / `char` / `Rune` case changing methods when `CultureInfo.InvariantCulture` is provided. All other cultures are unchanged.
 * Extension methods on `ReadOnlySpan<char>` which provide equivalent functionality to the above.
 * `StringComparer.OrdinalIgnoreCase`. All other `StringComparer` instances are unchanged.
 * Case changing methods on `CultureInfo.InvariantCulture.TextInfo`.

If `GlobalizationMode.Invariant` is specified, the behavior will be the same as it is today, where non-ASCII characters remain unchanged.

Applications which depend on `OrdinalIgnoreCase` equality being stable may be affected by this proposed change. That is, if an application relies on `"ß"` and `"ẞ"` being not equal under an `OrdinalIgnoreCase` comparer, that application is likely to experience odd behavior in the future.

In general, applications _cannot_ rely on such behavior anyway, because as previously mentioned the operating system historically has updated casing tables under the covers without the application getting a say. For example, after installing a new Windows version, a comparison which previously returned _false_ might start returning _true_:

```cs
string a = "ꝍ"; // U+A74D
string b = "Ꝍ"; // U+A74C

// today, may be "True" or "False" depending on which Windows version the app is running on.
// with this proposal, always returns "True"
bool areEqual = string.Equals(a, b, StringComparison.OrdinalIgnoreCase);
```

Furthermore, the string equality and case mapping information might be different between a web frontend application and the database it's using for backend storage. So performing such checks at the application level was never 100% reliable to begin with.

There is a potential oddity with this proposal: depending on operating system, two strings which compare as equal using `OrdinalIgnoreCase` might compare as not equal using `InvariantCultureIgnoreCase`. For example:

```cs
// with this proposal, returns "True" across all OSes
bool equalsOIC = "ß".Equals("ẞ", StringComparison.OrdinalIgnoreCase);

// with this proposal, returns "False" on Windows, "True" otherwise
bool equalsICIC = "ß".Equals("ẞ", StringComparison.InvariantCultureIgnoreCase);
```

I don't expect this to trip up most applications because I don't believe it to be common for an application to compare a string pair using two _different_ comparers, but it is worth pointing out as a curious edge case.

This may also lead to a discrepancy between managed code which uses `StringComparison.OrdinalIgnoreCase` and unmanaged code (including within the runtime) which uses [`CompareStringOrdinal`](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringordinal) on Windows. I cannot think offhand of any components which do this, but we need to be mindful that such a discrepancy might occur.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms #30960

Proposal

Justification

Breaking change discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms #30960

Description

Proposal

Justification

Breaking change discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions