Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement collation native functions functions #86895

Merged
merged 27 commits into from
Jun 21, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
164bb0a
Implemented IndexOf and LastIndexOf functions
mkhamoyan May 30, 2023
1e4e9a5
Updated test cases
mkhamoyan May 31, 2023
292b915
Remove not needed parts
mkhamoyan May 31, 2023
9d23f7b
Implemented IsPrefix, IsSuffix functions
mkhamoyan Jun 1, 2023
1f8b5d5
Remove logs
mkhamoyan Jun 1, 2023
7b91b6c
Fix CI build failures
mkhamoyan Jun 2, 2023
c4a2d3c
Refactored
mkhamoyan Jun 2, 2023
001c366
Fixed some test cases for OSX
mkhamoyan Jun 6, 2023
0ec8d79
Minor changes in test cases
mkhamoyan Jun 6, 2023
fa7322c
test case minor refactoring
mkhamoyan Jun 6, 2023
fc2837b
Merge branch 'main' into hybrid_collation_functions
mkhamoyan Jun 6, 2023
a7ab572
Merge branch 'main' into hybrid_collation_functions
mkhamoyan Jun 7, 2023
36de1a9
Merge branch 'main' into hybrid_collation_functions
mkhamoyan Jun 14, 2023
139a6ba
Changed IndexOf functions implementation
mkhamoyan Jun 15, 2023
f3da1e5
Fix build failue
mkhamoyan Jun 15, 2023
5c3f172
Minor fixes
mkhamoyan Jun 15, 2023
ab317f7
Minor fix
mkhamoyan Jun 15, 2023
24094fe
Refactor as per review comments
mkhamoyan Jun 16, 2023
c425d38
Refactored Indexing functions calls
mkhamoyan Jun 16, 2023
a3f44b4
Updated doc and added comments
mkhamoyan Jun 16, 2023
cbaaf80
Applied changes suggested by @jkotas
mkhamoyan Jun 16, 2023
caebcbe
Refactored some files
mkhamoyan Jun 19, 2023
71b026e
Make the doc more readable
mkhamoyan Jun 19, 2023
9dda83a
Refactored IndexOf function
mkhamoyan Jun 19, 2023
88c6861
Add more comments in IndexOF function
mkhamoyan Jun 19, 2023
460aba0
remove localizedStandardRangeOfString
mkhamoyan Jun 20, 2023
3d5a195
Added exception in case mixed compositions
mkhamoyan Jun 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 100 additions & 22 deletions docs/design/features/globalization-hybrid-mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ new CultureInfo("de-DE").CompareInfo.IndexOf("strasse", "stra\u00DFe", 0, Compar

For OSX platforms we are using native apis instead of ICU data.

**String comparison**
## String comparison

Affected public APIs:
- CompareInfo.Compare,
Expand All @@ -292,44 +292,122 @@ The number of `CompareOptions` and `NSStringCompareOptions` combinations are lim

- `None`:

`CompareOptions.None` is mapped to `NSStringCompareOptions.NSLiteralSearch`
`CompareOptions.None` is mapped to `NSStringCompareOptions.NSLiteralSearch`

There are some behaviour changes. Below are examples of such cases.
There are some behaviour changes. Below are examples of such cases.

| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** | **comments** |
|:---------------:|:---------------:|--------------------|:------------------------:|:-------:|:-------------------------------------------------------:|
| `\u3042` あ | `\u30A1` ァ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u304D\u3083` きゃ | `\u30AD\u30E3` キャ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u304D\u3083` きゃ | `\u30AD\u3083` キゃ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u3070\u3073\uFF8C\uFF9E\uFF8D\uFF9E\u307C` ばびブベぼ | `\u30D0\u30D3\u3076\u30D9\uFF8E\uFF9E` バビぶベボ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u3060` だ | `\u30C0` ダ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u00C0` À | `A\u0300` À | None | 1 | 0 | This is not same character for native api |
mdh1418 marked this conversation as resolved.
Show resolved Hide resolved
| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** | **comments** |
|:---------------:|:---------------:|--------------------|:------------------------:|:-------:|:-------------------------------------------------------:|
| `\u3042` あ | `\u30A1` ァ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u304D\u3083` きゃ | `\u30AD\u30E3` キャ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u304D\u3083` きゃ | `\u30AD\u3083` キゃ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u3070\u3073\uFF8C\uFF9E\uFF8D\uFF9E\u307C` ばびブベぼ | `\u30D0\u30D3\u3076\u30D9\uFF8E\uFF9E` バビぶベボ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u3060` だ | `\u30C0` ダ | None | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |

- `StringSort` :

`CompareOptions.StringSort` is mapped to `NSStringCompareOptions.NSLiteralSearch` .ICU's default is to use "StringSort", i.e. nonalphanumeric symbols come before alphanumeric. That is how works also `NSLiteralSearch`.
`CompareOptions.StringSort` is mapped to `NSStringCompareOptions.NSLiteralSearch` .ICU's default is to use "StringSort", i.e. nonalphanumeric symbols come before alphanumeric. That is how works also `NSLiteralSearch`.

- `IgnoreCase`:

`CompareOptions.IgnoreCase` is mapped to `NSStringCompareOptions.NSCaseInsensitiveSearch | NSStringCompareOptions.NSLiteralSearch`
`CompareOptions.IgnoreCase` is mapped to `NSStringCompareOptions.NSCaseInsensitiveSearch | NSStringCompareOptions.NSLiteralSearch`

There are some behaviour changes. Below are examples of such cases.
There are some behaviour changes. Below are examples of such cases.

| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** | **comments** |
|:---------------:|:---------------:|--------------------|:------------------------:|:-------:|:-------------------------------------------------------:|
| `\u3060` だ | `\u30C0` ダ | IgnoreCase | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |
| `\u00C0` À | `a\u0300` à | IgnoreCase | 1 | 0 | This is related to above mentioned case under `CompareOptions.None` i.e. `\u00C0` À != À `A\u0300` |
| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** | **comments** |
|:---------------:|:---------------:|--------------------|:------------------------:|:-------:|:-------------------------------------------------------:|
| `\u3060` だ | `\u30C0` ダ | IgnoreCase | 1 | -1 | hiragana and katakana characters are ordered differently compared to ICU |

- `IgnoreNonSpace`:

`CompareOptions.IgnoreNonSpace` is mapped to `NSStringCompareOptions.NSDiacriticInsensitiveSearch | NSStringCompareOptions.NSLiteralSearch`
`CompareOptions.IgnoreNonSpace` is mapped to `NSStringCompareOptions.NSDiacriticInsensitiveSearch | NSStringCompareOptions.NSLiteralSearch`

- `IgnoreWidth`:

`CompareOptions.IgnoreWidth` is mapped to `NSStringCompareOptions.NSWidthInsensitiveSearch | NSStringCompareOptions.NSLiteralSearch`
`CompareOptions.IgnoreWidth` is mapped to `NSStringCompareOptions.NSWidthInsensitiveSearch | NSStringCompareOptions.NSLiteralSearch`

- All combinations that contain below `CompareOptions` always throw `PlatformNotSupportedException`:

`IgnoreSymbols`,
`IgnoreSymbols`,

`IgnoreKanaType`,

## String starts with / ends with

Affected public APIs:
- CompareInfo.IsPrefix
- CompareInfo.IsSuffix
- String.StartsWith
- String.EndsWith

Mapped to Apple Native API `compare:options:range:locale:`(https://developer.apple.com/documentation/foundation/nsstring/1414561-compare?language=objc)
Apple Native API does not expose locale-sensitive endsWith/startsWith function. As a workaround, both strings get normalized and weightless characters are removed. Resulting strings are cut to the same length and comparison is performed. As we are normalizing strings to be able to cut them, we cannot calculate the match length on the original strings. Methods that calculate this information throw PlatformNotSupported exception:

- [CompareInfo.IsPrefix](https://learn.microsoft.com/dotnet/api/system.globalization.compareinfo.isprefix)
- [CompareInfo.IsSuffix](https://learn.microsoft.com/dotnet/api/system.globalization.compareinfo.issuffix)

- `IgnoreSymbols`

As there is no IgnoreSymbols equivalent in NSStringCompareOptions all `CompareOptions` combinations that include `IgnoreSymbols` throw `PlatformNotSupportedException`

## String indexing

Affected public APIs:
- CompareInfo.IndexOf
- CompareInfo.LastIndexOf
- String.IndexOf
- String.LastIndexOf

Mapped to Apple Native API `rangeOfString:options:range:locale:`(https://developer.apple.com/documentation/foundation/nsstring/1417348-rangeofstring?language=objc)

In `rangeOfString:options:range:locale:` objects are compared by checking the Unicode canonical equivalence of their code point sequences.
In cases where search string contains diaeresis and has different normalization form than in source string result can be incorrect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't limited to just the diaeresis diacritic mark is it? If it isn't limited to just diaeresis can we change it to diacritics.


Characters in general are represented by unicode code points, and some characters can be represented in a single code point or by combining multiple characters (like diacritics/diaeresis). Normalization Form C will look to compress characters to their single code point format if they were originally represented as a sequence of multiple code points. Normalization Form D does the opposite and expands characters into their multiple code point formats if possible.

`NSString` `rangeOfString:options:range:locale:` uses canonical equivalence to find the position of the `searchString` within the `sourceString`, however, it does not automatically handle comparison of precomposed (single code point representation) or decomposed (most code points representation). Because the `searchString` and `sourceString` can be of differing formats, to properly find the index, we need to ensure that the searchString is in the same form as the sourceString by checking the `rangeOfString:options:range:locale:` using every single normalization form.

Here are the covered cases with diaeresis:
1. Search string contains diaeresis and has same normalization form as in source string.
2. Search string contains diaeresis but with source string they have same letters with different char lengths but substring is normalized in source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: diacritic


a. search string `normalizing to form C` is substring of source string. example: search string: `U\u0308` source string: `Source is \u00DC` => matchLength is 1

b. search string `normalizing to form D` is substring of source string. example: search string: `\u00FC` source string: `Source is \u0075\u0308` => matchLength is 2

Not covered case:

Search string contains diaeresis and with source string they have same letters with different char lengths but substring is not
normalized in source. example: search string: `U\u0308 and \u00FC` source string: `Source is a\u0308\u0308a and \u0075\u0308`
as it is visible from example normalizaing search string to form C or D will not help to find substring in source string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm understanding this.
searchString U\u0308 and \u00FC (Ü and ü)
source string a\u0308\u0308a and \u0075\u0308 (ä̈a and ü)
Is the search string supposed to be a substring in the source string? Was it supposed to be U\u0308 and \u00FC should both be interpreted as substrings of a\u0308\u0308a and (\u0075\u0308) in that it matches the portion in the parenthesis?

I was actually expecting the not covered cases to be something like, source string is a sequence of unicode code points that represent a mixture of precomposed characters and decomposed characters.
i.e. Source string H\u00EBll\u006F\u0308 World (Hëllö World)
Search string \u0065\u0308ll\u00F6 (ëllö)
These would bypass the precomposed vs precomposed comparison because it would be source
string H\u00EBll\u00F6 World and search string \u00EBll\u00F6
However, neither the fully precomposed form of the search string \u00EBll\u00FB nor the fully decomposed form of the search string \u0065\u0308ll\u006F\u0308 would match the unicode code point sequence in the source string \u00EBll\u006F\u0308 so it would fall through the Form C and Form D fallbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Source should be "Source is \u00DC and \u0075\u0308".
search string: U\u0308 and \u00FC (Ü and ü) source string: Source is \u00DC and \u0075\u0308 (Source is Ü and ü). Will update it.


- `IgnoreSymbols`

As there is no IgnoreSymbols equivalent in NSStringCompareOptions all `CompareOptions` combinations that include `IgnoreSymbols` throw `PlatformNotSupportedException`

- Some letters consist of more than one grapheme.

Apple Native Api does not guarantee that string will be segmented by letters but by graphemes. E.g. in `cs-CZ` and `sk-SK` "ch" is 1 letter, 2 graphemes. The following code with `HybridGlobalization` switched off returns -1 (not found) while with `HybridGlobalization` switched on, it returns 1.

``` C#
new CultureInfo("sk-SK").CompareInfo.IndexOf("ch", "h"); // -1 or 1
```

- Some graphemes have multi-grapheme equivalents.
E.g. in `de-DE` ß (%u00DF) is one letter and one grapheme and "ss" is one letter and is recognized as two graphemes. Apple Native API's equivalent of `IgnoreNonSpace` treats them as the same letter when comparing. Similar case: dz (%u01F3) and dz.

Using `IgnoreNonSpace` for these two with `HybridGlobalization` off, also returns 0 (they are equal). However, the workaround used in `HybridGlobalization` will compare them grapheme-by-grapheme and will return -1.

``` C#
new CultureInfo("de-DE").CompareInfo.IndexOf("strasse", "stra\u00DFe", 0, CompareOptions.IgnoreNonSpace); // 0 or -1
```


## SortKey

Affected public APIs:
- CompareInfo.GetSortKey
- CompareInfo.GetSortKeyLength
- CompareInfo.GetHashCode

`IgnoreKanaType`,
Apple Native API does not have an equivalent, so they throw `PlatformNotSupportedException`.
11 changes: 11 additions & 0 deletions src/libraries/Common/src/Interop/Interop.Collation.OSX.cs
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,16 @@ internal static partial class Globalization
{
[LibraryImport(Libraries.GlobalizationNative, EntryPoint = "GlobalizationNative_CompareStringNative", StringMarshalling = StringMarshalling.Utf16)]
internal static unsafe partial int CompareStringNative(string localeName, int lNameLen, char* lpStr1, int cwStr1Len, char* lpStr2, int cwStr2Len, CompareOptions options);

[LibraryImport(Libraries.GlobalizationNative, EntryPoint = "GlobalizationNative_EndsWithNative", StringMarshalling = StringMarshalling.Utf16)]
[MethodImpl(MethodImplOptions.NoInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why are we adding NoInlining here and on StartsWithNative but not on the other ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did like is done for ICU https://github.com/dotnet/runtime/blob/main/src/libraries/Common/src/Interop/Interop.Collation.cs#L34

But not sure why only these 2 functions have MethodImplOptions.NoInlining

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that it is to avoid regressing the hot path of the method by inlining the PInvoke.

internal static unsafe partial int EndsWithNative(string localeName, int lNameLen, char* target, int cwTargetLength, char* source, int cwSourceLength, CompareOptions options);

[LibraryImport(Libraries.GlobalizationNative, EntryPoint = "GlobalizationNative_IndexOfNative", StringMarshalling = StringMarshalling.Utf16)]
internal static unsafe partial NSRange IndexOfNative(string localeName, int lNameLen, char* target, int cwTargetLength, char* pSource, int cwSourceLength, CompareOptions options, [MarshalAs(UnmanagedType.Bool)] bool fromBeginning);

[LibraryImport(Libraries.GlobalizationNative, EntryPoint = "GlobalizationNative_StartsWithNative", StringMarshalling = StringMarshalling.Utf16)]
[MethodImpl(MethodImplOptions.NoInlining)]
internal static unsafe partial int StartsWithNative(string localeName, int lNameLen, char* target, int cwTargetLength, char* source, int cwSourceLength, CompareOptions options);
}
}
11 changes: 11 additions & 0 deletions src/libraries/Common/src/Interop/OSX/NSRange.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System.Runtime.InteropServices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using System.Runtime.InteropServices;

Unnecessary

namespace System.Globalization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace System.Globalization
internal static partial class Interop

This type is not specific to globalization. It should have neutral namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the unmanaged side is not returning the NSRange type in the latest iteration. It returns its own Range type. In that case, it should not be called NSRange. It should be called Range to match the unmanaged side and it should be in System.Globalization.iOS.cs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using NSRange https://developer.apple.com/documentation/foundation/nsrange?language=objc on pal_collation.m,
but to make it compiled on https://github.com/dotnet/runtime/blob/main/src/native/libs/System.Globalization.Native/pal_collation.h I needed to add Range type.
Do you mean Range struct should be added in System.Globalization.iOS.cs ? I couldn't find that file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant src/libraries/Common/src/Interop/Interop.Collation.OSX.cs

{
internal struct NSRange
{
public int Location;
public int Length;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -192,10 +192,10 @@ public static IEnumerable<object[]> Compare_TestData()
yield return new object[] { s_invariantCompare, "i", "\u0130", CompareOptions.None, -1 };
yield return new object[] { s_invariantCompare, "i", "\u0130", CompareOptions.IgnoreCase, -1 };

yield return new object[] { s_invariantCompare, "\u00C0", "A\u0300", CompareOptions.None, PlatformDetection.IsHybridGlobalizationOnOSX ? 1 : 0 };
yield return new object[] { s_invariantCompare, "\u00C0", "A\u0300", CompareOptions.None, 0 };
yield return new object[] { s_invariantCompare, "\u00C0", "A\u0300", CompareOptions.Ordinal, 1 };
yield return new object[] { s_invariantCompare, "\u00C0", "a\u0300", CompareOptions.None, 1 };
yield return new object[] { s_invariantCompare, "\u00C0", "a\u0300", CompareOptions.IgnoreCase, PlatformDetection.IsHybridGlobalizationOnOSX ? 1 : 0 };
yield return new object[] { s_invariantCompare, "\u00C0", "a\u0300", CompareOptions.IgnoreCase, 0 };
yield return new object[] { s_invariantCompare, "\u00C0", "a\u0300", CompareOptions.Ordinal, 1 };
yield return new object[] { s_invariantCompare, "\u00C0", "a\u0300", CompareOptions.OrdinalIgnoreCase, 1 };
yield return new object[] { s_invariantCompare, "FooBar", "Foo\u0400Bar", CompareOptions.Ordinal, -1 };
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public static IEnumerable<object[]> IndexOf_TestData()
yield return new object[] { s_invariantCompare, "foobardzsdzs", "rddzs", 0, 12, CompareOptions.Ordinal, -1, 0 };

// Slovak
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
if (!PlatformDetection.IsHybridGlobalizationOnBrowser && !PlatformDetection.IsHybridGlobalizationOnOSX)
{
yield return new object[] { s_slovakCompare, "ch", "h", 0, 2, CompareOptions.None, -1, 0 };
// Android has its own ICU, which doesn't work well with slovak
Expand Down Expand Up @@ -82,7 +82,7 @@ public static IEnumerable<object[]> IndexOf_TestData()
yield return new object[] { s_invariantCompare, "hello", "\u200d", 1, 3, CompareOptions.IgnoreCase, 1, 0 };

// Ignore symbols
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
if (!PlatformDetection.IsHybridGlobalizationOnBrowser && !PlatformDetection.IsHybridGlobalizationOnOSX)
yield return new object[] { s_invariantCompare, "More Test's", "Tests", 0, 11, CompareOptions.IgnoreSymbols, 5, 6 };
yield return new object[] { s_invariantCompare, "More Test's", "Tests", 0, 11, CompareOptions.None, -1, 0 };
yield return new object[] { s_invariantCompare, "cbabababdbaba", "ab", 0, 13, CompareOptions.None, 2, 2 };
Expand Down Expand Up @@ -142,8 +142,11 @@ public static IEnumerable<object[]> IndexOf_TestData()
{
yield return new object[] { s_germanCompare, "abc Strasse Strasse xyz", "stra\u00DFe", 0, 23, supportedIgnoreCaseIgnoreNonSpaceOptions, 4, 7 };
yield return new object[] { s_germanCompare, "abc stra\u00DFe stra\u00DFe xyz", "Strasse", 0, 21, supportedIgnoreCaseIgnoreNonSpaceOptions, 4, 6 };
yield return new object[] { s_invariantCompare, "abcdzxyz", "\u01F3", 0, 8, supportedIgnoreNonSpaceOption, 3, 2 };
yield return new object[] { s_invariantCompare, "abc\u01F3xyz", "dz", 0, 7, supportedIgnoreNonSpaceOption, 3, 1 };
if (!PlatformDetection.IsHybridGlobalizationOnOSX)
{
yield return new object[] { s_invariantCompare, "abcdzxyz", "\u01F3", 0, 8, supportedIgnoreNonSpaceOption, 3, 2 };
yield return new object[] { s_invariantCompare, "abc\u01F3xyz", "dz", 0, 7, supportedIgnoreNonSpaceOption, 3, 1 };
ilonatommy marked this conversation as resolved.
Show resolved Hide resolved
}
}
yield return new object[] { s_germanCompare, "abc Strasse Strasse xyz", "xtra\u00DFe", 0, 23, supportedIgnoreCaseIgnoreNonSpaceOptions, -1, 0 };
yield return new object[] { s_germanCompare, "abc stra\u00DFe stra\u00DFe xyz", "Xtrasse", 0, 21, supportedIgnoreCaseIgnoreNonSpaceOptions, -1, 0 };
Expand Down
Loading