Skip to content

Conversation

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Jan 9, 2020

Fixes #25428 (new span-based APIs on CompareInfo).
Fixes #26621 (remove code duplicate in CompareInfo).
Contributes to #27912 (flow Rune through more APIs).
Fixes #34828 (CompareInfo.IndexOf zero-weight code point issue).

The string and CompareInfo classes have some incorrect shortcut logic when dealing with APIs like IndexOf. This logic doesn't account for the fact that certain non-empty strings may be linguistically equivalent to empty (i.e., consist only of weightless Unicode code points under some locale). See #1062 for a related issue affecting string.Equals and string.GetHashCode. This may cause the methods to return "no match" when they should have returned "a match exists". In the case of string.Replace, it could cause an endless loop.

The weightless code point issue only occurs when a culture-aware (i.e., not Ordinal or OrdinalIgnoreCase, and not under GlobalizationMode.Invariant) comparer is used.

Additionally, the LastIndexOf APIs have historically returned incorrect values when given empty string (or empty-equivalent) inputs. For example:

// U+200D below is the weightless zero-width joiner (ZWJ) code point.

Console.WriteLine("abcde".LastIndexOf(string.Empty, StringComparison.InvariantCulture)); // prints 4 (should print 5)
Console.WriteLine("abcde".LastIndexOf("\u200d", StringComparison.InvariantCulture));     // prints 5 (should print 5)

Console.WriteLine(string.Empty.LastIndexOf(string.Empty, StringComparison.InvariantCulture)); // prints 0  (should print 0)
Console.WriteLine(string.Empty.LastIndexOf("\u200d", StringComparison.InvariantCulture));     // prints -1 (should print 0)

See #13383 for more information on why the results above should be 5 and 0, respectively.

Below is the list of all existing public APIs affected by this PR.

  • string.IndexOf and string.LastIndexOf
  • string.StartsWith and string.EndsWith
  • string.Contains
  • string.Replace
  • CompareInfo.IndexOf and CompareInfo.LastIndexOf
  • CompareInfo.IsPrefix and CompareInfo.IsSuffix
  • MemoryExtensions.IndexOf and MemoryExtensions.LastIndexOf
  • MemoryExtensions.StartsWith and MemoryExtensions.EndsWith
  • MemoryExtensions.Contains

StringBuilder is not affected since it neither exposes a LastIndexOf method nor has culture-aware logic.

This PR also removes most duplicate code from the CompareInfo code paths, favoring span-based code wherever possible. Recent improvements to the JIT stack zeroing logic also allows us to remove our workaround for #8890.

Finally, this PR introduces some Rune-based APIs in the CompareInfo class. Not all of these APIs are yet approved so I'll need to get a quick email signoff. The pattern I followed is that if there was a CompareInfo API that accepted char as input, there should be an equivalent which accepts Rune as input in the same location. There's also an allocation-free variant of GetSortKey for folks who need it.

@GrabYourPitchforks GrabYourPitchforks added the tenet-compatibility Incompatibility with previous versions or .NET Framework label Jan 9, 2020
@GrabYourPitchforks GrabYourPitchforks changed the title Update string IndexOf / LastIndexOf / StartsWith / EndsWith Fix weightless code point identification in IndexOf / LastIndexOf / StartsWith / EndsWith Jan 9, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this check outside the loop for the perf sake. it is enough to check this condition first time only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't quite figure out how to do this. The alternative would be to do an "is this equivalent to empty?" check at the beginning of the method, but that's a full p/invoke and call into NLS and I think it would vastly outweigh a simple if check in the middle of a loop.

Do you have a recommendation for how to do this?

Copy link
Member

@tarekgh tarekgh Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may do the following:
index = ci.IndexOf(this, oldValue, startIndex, this.Length - startIndex, options, &matchLength);
if (matchLength == 0)
{
throw....
}

        do
        {
            if (index >= 0)
            {
                ...
             }

            ....

            index = ci.IndexOf(this, oldValue, startIndex, this.Length - startIndex, options, &matchLength);

        }

In reply to: 364889287 [](ancestors = 364889287)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the call to IndexOf is itself a p/invoke, which has significant overhead. Certainly more overhead than a single if check (which the branch predictor will optimize not-taken) in the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine if you think it is not worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is correct but this something you want to double check. here is the scenario

this = ""
value = "A"
startIndex = -1

I believe this should return -1 but now we'll return 0 which is wrong I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in such cases we should just compare value string against "" and return right away. no need to go further. This should be done after validating the flags in the next if block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth moving this block after the check (options & ValidIndexMaskOffFlags) and in this block we can return either way with handling the case if value length is not 0. something like:

        if (startIndex < 0)
        {
            if (source.Length == 0)
            {
                 return value.Length == 0 || Compare("", value,...) is true ? 0 : -1;
            }
            else
            {
                throw new ArgumentOutOfRangeException(nameof(startIndex), SR.ArgumentOutOfRange_Index);
            }

Also we may think handling he case value.Length == 0 and this.Length > 0 too if we need to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startIndex [](start = 56, length = 10)

isn't IndexOfCore already adding the start index to the return value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as the one I added to similar code.

@GrabYourPitchforks
Copy link
Member Author

Most recent iteration addresses earlier PR feedback and also publicly exposes span-based APIs since I was touching all of these code paths anyway. I'm going to perform one more rebase with a more limited set of commits - hopefully only 4 or 5 commits total. That should make it significantly easier to review, as each commit would correspond to a logical chunk of work and could be reviewed in isolation.

@GrabYourPitchforks GrabYourPitchforks added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 31, 2020
@stephentoub
Copy link
Member

@GrabYourPitchforks, are you still working on this?

@GrabYourPitchforks
Copy link
Member Author

Yes. I've started breaking it out into smaller PRs, such as #32385. Will push on those to try to speed up review.

@stephentoub
Copy link
Member

I've started breaking it out into smaller PRs

Thanks. And this one will still be relevant?

@GrabYourPitchforks GrabYourPitchforks added breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. and removed tenet-compatibility Incompatibility with previous versions or .NET Framework labels Apr 10, 2020
@GrabYourPitchforks GrabYourPitchforks added this to the 5.0 milestone Apr 10, 2020
@GrabYourPitchforks
Copy link
Member Author

I'm having a really, really difficult time trying to figure out why some tests are failing on ICU, but only on certain platforms. See for instance https://helix.dot.net/api/2019-06-17/jobs/f2e28863-7590-4c49-9cad-5648f983d563/workitems/System.Memory.Tests/console:

    System.Tests.StringTests.Compare(strA: null, indexA: 0, strB: "Hello", indexB: 0, length: 5, comparisonType: InvariantCultureIgnoreCase, expected: -1) [FAIL]
      Assert.Equal() Failure
      Expected: -1
      Actual:   0
      Stack Trace:
        /_/src/libraries/Common/tests/Tests/System/StringTests.cs(707,0): at System.Tests.StringTests.Compare(String strA, Int32 indexA, String strB, Int32 indexB, Int32 length, StringComparison comparisonType, Int32 expected)

I see nothing in the code that looks suspicious, nor can I repro this on my local box. Suspect it has something to do with whatever version of ICU is on these platforms not appreciating null as an input parameter.

@GrabYourPitchforks
Copy link
Member Author

Yup, it ended up being an ICU bug. Filed at https://unicode-org.atlassian.net/projects/ICU/issues/ICU-9396, fixed in unicode-org/icu@271e788. The fix was committed almost 7 years ago but I suppose there are still some ancient versions floating around on our test platforms.

@janvorli
Copy link
Member

I suppose there are still some ancient versions floating around on our test platforms.

We support ICU >= 50.0, CentOS 7 has version 50.2

@GrabYourPitchforks
Copy link
Member Author

There's also a Win7-specific AV I'm investigating. Hope to have some resolution on that today.

@GrabYourPitchforks
Copy link
Member Author

Oh man, CI's finally green on this. I'm going to celebrate by downing an entire cake, all by myself. 🎂

@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2020

@GrabYourPitchforks push the button before it changes its color :-)

@GrabYourPitchforks
Copy link
Member Author

Can't push the button just yet until tomorrow's API review. :)

Throw useful exception when destination buffer too small
@GrabYourPitchforks GrabYourPitchforks removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 24, 2020
@GrabYourPitchforks GrabYourPitchforks merged commit 90cc6df into dotnet:master Apr 24, 2020
@GrabYourPitchforks GrabYourPitchforks deleted the string_replace branch April 24, 2020 23:47
@danmoseley
Copy link
Member

@GrabYourPitchforks this is labeled breaking change but I can't find a breaking change issue I don't think. If necessary can you please create one with https://github.com/dotnet/docs/issues/new?template=dotnet-breaking-change.md

@danmoseley danmoseley added the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Jul 30, 2020
@GrabYourPitchforks
Copy link
Member Author

Breaking change doc at dotnet/docs#21372.

@GrabYourPitchforks GrabYourPitchforks removed the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Nov 6, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease.

Projects

None yet

8 participants