Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement collation native functions functions #86895

Merged
merged 27 commits into from
Jun 21, 2023
Merged

Conversation

mkhamoyan
Copy link
Contributor

@mkhamoyan mkhamoyan commented May 30, 2023

Implements a chunk of apple native-api based globalization.

Old, icu-based private API:

  • GlobalizationNative_StartsWith
  • GlobalizationNative_EndsWith
  • GlobalizationNative_IndexOf
  • GlobalizationNative_LastIndexOf

New, non-icu private API:

  • GlobalizationNative_StartsWithNative
  • GlobalizationNative_EndsWithNative
  • GlobalizationNative_IndexOfNative // will be called also for LastIndexOf

Affected public API :

  • CompareInfo.IsPrefix
  • CompareInfo.IsSuffix
  • String.StartsWith
  • String.EndsWith
  • CompareInfo.IndexOf
  • CompareInfo.LastIndexOf
  • String.IndexOf
  • String.LastIndexOf

APIs that will not be supported

  • CompareInfo.GetSortKey
  • CompareInfo.GetSortKeyLength
  • CompareInfo.GetHashCode

Fixes #86699
Contributes to #80689
cc @SamMonoRT

@ghost
Copy link

ghost commented May 30, 2023

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: mkhamoyan
Assignees: mkhamoyan
Labels:

area-System.Globalization

Milestone: -

@mkhamoyan mkhamoyan added the NO-REVIEW Experimental/testing PR, do NOT review it label May 30, 2023
@tarekgh tarekgh added os-mac-os-x macOS aka OSX os-ios Apple iOS os-tvos Apple tvOS labels May 30, 2023
@ghost
Copy link

ghost commented May 30, 2023

Tagging subscribers to 'os-ios': @steveisok, @akoeplinger
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: mkhamoyan
Assignees: mkhamoyan
Labels:

NO-REVIEW, area-System.Globalization, os-mac-os-x, os-ios, os-tvos

Milestone: -

@mkhamoyan mkhamoyan changed the title Implemented IndexOf and LastIndexOf functions Implemented IndexOf, LastIndexOf, StartsWith, EndsWith functions Jun 1, 2023
@mkhamoyan mkhamoyan changed the title Implemented IndexOf, LastIndexOf, StartsWith, EndsWith functions Implement IndexOf, LastIndexOf, StartsWith, EndsWith functions Jun 1, 2023
@mkhamoyan mkhamoyan changed the title Implement IndexOf, LastIndexOf, StartsWith, EndsWith functions Implement collation native functions functions Jun 1, 2023
@mkhamoyan mkhamoyan removed the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 2, 2023
@@ -12,5 +12,18 @@ internal static partial class Globalization
{
[LibraryImport(Libraries.GlobalizationNative, EntryPoint = "GlobalizationNative_CompareStringNative", StringMarshalling = StringMarshalling.Utf16)]
internal static unsafe partial int CompareStringNative(string localeName, int lNameLen, char* lpStr1, int cwStr1Len, char* lpStr2, int cwStr2Len, CompareOptions options);

[LibraryImport(Libraries.GlobalizationNative, EntryPoint = "GlobalizationNative_EndsWithNative", StringMarshalling = StringMarshalling.Utf16)]
[MethodImpl(MethodImplOptions.NoInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why are we adding NoInlining here and on StartsWithNative but not on the other ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did like is done for ICU https://github.com/dotnet/runtime/blob/main/src/libraries/Common/src/Interop/Interop.Collation.cs#L34

But not sure why only these 2 functions have MethodImplOptions.NoInlining

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that it is to avoid regressing the hot path of the method by inlining the PInvoke.

@mkhamoyan
Copy link
Contributor Author

/azp run runtime-ioslike

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Mapped to Apple Native API `rangeOfString:options:range:locale:`(https://developer.apple.com/documentation/foundation/nsstring/1417348-rangeofstring?language=objc)

In `rangeOfString:options:range:locale:` objects are compared by checking the Unicode canonical equivalence of their code point sequences.
In cases where search string contains diaeresis and has different normalization form than in source string result can be incorrect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't limited to just the diaeresis diacritic mark is it? If it isn't limited to just diaeresis can we change it to diacritics.

Comment on lines 376 to 378
Search string contains diaeresis and with source string they have same letters with different char lengths but substring is not
normalized in source. example: search string: `U\u0308 and \u00FC` source string: `Source is a\u0308\u0308a and \u0075\u0308`
as it is visible from example normalizaing search string to form C or D will not help to find substring in source string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm understanding this.
searchString U\u0308 and \u00FC (Ü and ü)
source string a\u0308\u0308a and \u0075\u0308 (ä̈a and ü)
Is the search string supposed to be a substring in the source string? Was it supposed to be U\u0308 and \u00FC should both be interpreted as substrings of a\u0308\u0308a and (\u0075\u0308) in that it matches the portion in the parenthesis?

I was actually expecting the not covered cases to be something like, source string is a sequence of unicode code points that represent a mixture of precomposed characters and decomposed characters.
i.e. Source string H\u00EBll\u006F\u0308 World (Hëllö World)
Search string \u0065\u0308ll\u00F6 (ëllö)
These would bypass the precomposed vs precomposed comparison because it would be source
string H\u00EBll\u00F6 World and search string \u00EBll\u00F6
However, neither the fully precomposed form of the search string \u00EBll\u00FB nor the fully decomposed form of the search string \u0065\u0308ll\u006F\u0308 would match the unicode code point sequence in the source string \u00EBll\u006F\u0308 so it would fall through the Form C and Form D fallbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Source should be "Source is \u00DC and \u0075\u0308".
search string: U\u0308 and \u00FC (Ü and ü) source string: Source is \u00DC and \u0075\u0308 (Source is Ü and ü). Will update it.

Comment on lines 144 to 153
// localizedStandardRangeOfString is performing a case and diacritic insensitive, locale-aware search and finding first occurance.
if ((comparisonOptions & IgnoreCase) && lNameLength == 0 && fromBeginning)
{
NSRange localizedStandardRange = [sourceStrCleaned localizedStandardRangeOfString:searchStrCleaned];
if (localizedStandardRange.location != NSNotFound)
{
result.location = localizedStandardRange.location;
result.length = localizedStandardRange.length;
return result;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating comment

localizedStandardRangeOfString is performing case and diacritic insensitive, locale-aware search by implementation, that is why I am checking for IgnoreCase and if locale is not passed .(see https://developer.apple.com/documentation/foundation/nsstring/1413574-localizedstandardrangeofstring)


Right, I see that it finds the first match with a case insensitive and diacritic insensitive search, but are the other 7 permutations of these options (!(compareOptions & IgnoreCase) || lNameLength != 0 || !fromBeginning) covered by the following logic?

I'm wondering if this block is truly necessary. If ((comparisonOptions & IgnoreCase) && lNameLength == 0 && !fromBeginning) is covered by the following logic, I would imagine that this case is covered as well without this block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does lNameLength == 0 mean locale-aware? If its supposed to be the system locale [NSLocale systemLocale], then shouldn't this check for localeName == NULL || lNameLength == 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the IgnoreNonSpace be the diacritic insensitive condition as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localizedStandardRangeOfString is not necessary , if it makes the function more complicated I will remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the block suggested to me that localizedStandardRangeOfString could cover one specific case that the rest of the following logic couldn't, which felt odd specifically because that this wouldn't be hit if fromBeginning was false. Another consideration is that it seems to ignore the rest of the flags of options besides case insensitive and diacritic insensitive, so I wonder if this block would have been correct in those scenarios (if we had the test cases for them). I think it could make sense to keep it if it was much faster than rangeOfString:options:range:locale:, but it seems cleaner to rely on the following logic if it can be handled correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it handles correctly all cases without localizedStandardRangeOfString.

@mkhamoyan
Copy link
Contributor Author

/azp run runtime-ioslike

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mkhamoyan mkhamoyan requested a review from mdh1418 June 20, 2023 10:21
Copy link
Member

@mdh1418 mdh1418 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for your patience!

Comment on lines 371 to 372
1. Search string contains diaeresis and has same normalization form as in source string.
2. Search string contains diaeresis but with source string they have same letters with different char lengths but substring is normalized in source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: diacritic

Comment on lines 380 to 381
Search string contains diacritics and with source string they have same letters with different char lengths but substring is not
normalized in source. example: search string: `U\u0308 and \u00FC` (Ü and ü) source string: `Source is \u00DC and \u0075\u0308` (Source is Ü and ü)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Source string's intended substring match containing characters of mixed composition forms cannot be matched by 2. because partial precomposition/decomposition is not performed.

@@ -107,13 +107,16 @@ public static IEnumerable<object[]> IsSuffix_TestData()
{
yield return new object[] { s_hungarianCompare, "foobardzsdzs", "rddzs", CompareOptions.None, false, 0 };
yield return new object[] { s_frenchCompare, "\u0153", "oe", CompareOptions.None, false, 0 };
yield return new object[] { s_invariantCompare, "\uD800\uDC00", "\uDC00", CompareOptions.None, false, 0 };
yield return new object[] { s_invariantCompare, "\uD800\uDC00", "\uDC00", CompareOptions.IgnoreCase, false, 0 };
if (!PlatformDetection.IsHybridGlobalizationOnOSX) // TODO: check this for OSX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a todo?

return result;

// in case search string is inside source string but we can't find the index return -2
result.location = -2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may conflict with the Debug.Assert(result.Location != -2); in CompareInfo.OSX.cs because we might actually hit -2 if we hit the unsupported mixed composition source string case. Perhaps we can use something different to indicate that this is a case that is currently unsupported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case changed return value to -3 and added exception.

@@ -0,0 +1,11 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System.Runtime.InteropServices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using System.Runtime.InteropServices;

Unnecessary

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System.Runtime.InteropServices;
namespace System.Globalization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace System.Globalization
internal static partial class Interop

This type is not specific to globalization. It should have neutral namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the unmanaged side is not returning the NSRange type in the latest iteration. It returns its own Range type. In that case, it should not be called NSRange. It should be called Range to match the unmanaged side and it should be in System.Globalization.iOS.cs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using NSRange https://developer.apple.com/documentation/foundation/nsrange?language=objc on pal_collation.m,
but to make it compiled on https://github.com/dotnet/runtime/blob/main/src/native/libs/System.Globalization.Native/pal_collation.h I needed to add Range type.
Do you mean Range struct should be added in System.Globalization.iOS.cs ? I couldn't find that file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant src/libraries/Common/src/Interop/Interop.Collation.OSX.cs

@mkhamoyan
Copy link
Contributor Author

/azp run runtime-ioslike

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mkhamoyan
Copy link
Contributor Author

Failures are not related.

@mkhamoyan mkhamoyan merged commit eccc410 into main Jun 21, 2023
@MichalStrehovsky MichalStrehovsky deleted the hybrid_collation_functions branch June 22, 2023 09:21
@ghost ghost locked as resolved and limited conversation to collaborators Jul 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OSX] HybridGlobalization implement collation native functions
5 participants