- 
                Notifications
    You must be signed in to change notification settings 
- Fork 5.2k
Fix weightless code point handling in IndexOf and friends, plus other CompareInfo improvements #1514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix weightless code point handling in IndexOf and friends, plus other CompareInfo improvements #1514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to move this check outside the loop for the perf sake. it is enough to check this condition first time only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't quite figure out how to do this. The alternative would be to do an "is this equivalent to empty?" check at the beginning of the method, but that's a full p/invoke and call into NLS and I think it would vastly outweigh a simple if check in the middle of a loop.
Do you have a recommendation for how to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may do the following:
index = ci.IndexOf(this, oldValue, startIndex, this.Length - startIndex, options, &matchLength);
if (matchLength == 0)
{
throw....
}
        do
        {
            if (index >= 0)
            {
                ...
             }
            ....
            index = ci.IndexOf(this, oldValue, startIndex, this.Length - startIndex, options, &matchLength);
        }
In reply to: 364889287 [](ancestors = 364889287)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that the call to IndexOf is itself a p/invoke, which has significant overhead. Certainly more overhead than a single if check (which the branch predictor will optimize not-taken) in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine if you think it is not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is correct but this something you want to double check. here is the scenario
this = ""
value = "A"
startIndex = -1
I believe this should return -1 but now we'll return 0 which is wrong I guess.
        
          
                src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe in such cases we should just compare value string against "" and return right away. no need to go further. This should be done after validating the flags in the next if block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is worth moving this block after the check (options & ValidIndexMaskOffFlags) and in this block we can return either way with handling the case if value length is not 0. something like:
        if (startIndex < 0)
        {
            if (source.Length == 0)
            {
                 return value.Length == 0 || Compare("", value,...) is true ? 0 : -1;
            }
            else
            {
                throw new ArgumentOutOfRangeException(nameof(startIndex), SR.ArgumentOutOfRange_Index);
            }
Also we may think handling he case value.Length == 0 and this.Length > 0 too if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startIndex [](start = 56, length = 10)
isn't IndexOfCore already adding the start index to the return value?
        
          
                src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as the one I added to similar code.
456552c    to
    8d8e253      
    Compare
  
    | Most recent iteration addresses earlier PR feedback and also publicly exposes span-based APIs since I was touching all of these code paths anyway. I'm going to perform one more rebase with a more limited set of commits - hopefully only 4 or 5 commits total. That should make it significantly easier to review, as each commit would correspond to a logical chunk of work and could be reviewed in isolation. | 
| @GrabYourPitchforks, are you still working on this? | 
| Yes. I've started breaking it out into smaller PRs, such as #32385. Will push on those to try to speed up review. | 
| 
 Thanks. And this one will still be relevant? | 
        
          
                src/libraries/Native/Unix/System.Globalization.Native/pal_collation.c
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | I'm having a really, really difficult time trying to figure out why some tests are failing on ICU, but only on certain platforms. See for instance https://helix.dot.net/api/2019-06-17/jobs/f2e28863-7590-4c49-9cad-5648f983d563/workitems/System.Memory.Tests/console:     System.Tests.StringTests.Compare(strA: null, indexA: 0, strB: "Hello", indexB: 0, length: 5, comparisonType: InvariantCultureIgnoreCase, expected: -1) [FAIL]
      Assert.Equal() Failure
      Expected: -1
      Actual:   0
      Stack Trace:
        /_/src/libraries/Common/tests/Tests/System/StringTests.cs(707,0): at System.Tests.StringTests.Compare(String strA, Int32 indexA, String strB, Int32 indexB, Int32 length, StringComparison comparisonType, Int32 expected)I see nothing in the code that looks suspicious, nor can I repro this on my local box. Suspect it has something to do with whatever version of ICU is on these platforms not appreciating null as an input parameter. | 
| Yup, it ended up being an ICU bug. Filed at https://unicode-org.atlassian.net/projects/ICU/issues/ICU-9396, fixed in unicode-org/icu@271e788. The fix was committed almost 7 years ago but I suppose there are still some ancient versions floating around on our test platforms. | 
| 
 We support ICU >= 50.0, CentOS 7 has version 50.2 | 
| There's also a Win7-specific AV I'm investigating. Hope to have some resolution on that today. | 
| Oh man, CI's finally green on this. I'm going to celebrate by downing an entire cake, all by myself. 🎂 | 
| @GrabYourPitchforks push the button before it changes its color :-) | 
| Can't push the button just yet until tomorrow's API review. :) | 
Throw useful exception when destination buffer too small
| @GrabYourPitchforks this is labeled breaking change but I can't find a breaking change issue I don't think. If necessary can you please create one with https://github.com/dotnet/docs/issues/new?template=dotnet-breaking-change.md | 
| Breaking change doc at dotnet/docs#21372. | 
Fixes #25428 (new span-based APIs on
CompareInfo).Fixes #26621 (remove code duplicate in
CompareInfo).Contributes to #27912 (flow
Runethrough more APIs).Fixes #34828 (
CompareInfo.IndexOfzero-weight code point issue).The
stringandCompareInfoclasses have some incorrect shortcut logic when dealing with APIs likeIndexOf. This logic doesn't account for the fact that certain non-empty strings may be linguistically equivalent to empty (i.e., consist only of weightless Unicode code points under some locale). See #1062 for a related issue affectingstring.Equalsandstring.GetHashCode. This may cause the methods to return "no match" when they should have returned "a match exists". In the case ofstring.Replace, it could cause an endless loop.The weightless code point issue only occurs when a culture-aware (i.e., not
OrdinalorOrdinalIgnoreCase, and not underGlobalizationMode.Invariant) comparer is used.Additionally, the
LastIndexOfAPIs have historically returned incorrect values when given empty string (or empty-equivalent) inputs. For example:See #13383 for more information on why the results above should be 5 and 0, respectively.
Below is the list of all existing public APIs affected by this PR.
string.IndexOfandstring.LastIndexOfstring.StartsWithandstring.EndsWithstring.Containsstring.ReplaceCompareInfo.IndexOfandCompareInfo.LastIndexOfCompareInfo.IsPrefixandCompareInfo.IsSuffixMemoryExtensions.IndexOfandMemoryExtensions.LastIndexOfMemoryExtensions.StartsWithandMemoryExtensions.EndsWithMemoryExtensions.ContainsStringBuilderis not affected since it neither exposes aLastIndexOfmethod nor has culture-aware logic.This PR also removes most duplicate code from the
CompareInfocode paths, favoring span-based code wherever possible. Recent improvements to the JIT stack zeroing logic also allows us to remove our workaround for #8890.Finally, this PR introduces some
Rune-based APIs in theCompareInfoclass. Not all of these APIs are yet approved so I'll need to get a quick email signoff. The pattern I followed is that if there was aCompareInfoAPI that acceptedcharas input, there should be an equivalent which acceptsRuneas input in the same location. There's also an allocation-free variant ofGetSortKeyfor folks who need it.