Skip to content

Optimize CheckIriUnicodeRange #31860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 18, 2020

Conversation

MihaZupan
Copy link
Member

Avoid using an intermediate string to do range comparisons, behavior remains the same for all inputs.

Perf for that method alone is ~10-15x,

Perf for "scheme:" + { '\ud83f', '\udffe' } * 1000

Method Toolchain Mean Ratio Gen 0 Gen 1 Gen 2 Allocated
NewUr1 clean\CoreRun.exe 345.1 us 1.31 69.3359 13.6719 - 285.44 KB
NewUr1 new\CoreRun.exe 270.0 us 1.00 62.0117 12.2070 - 254.19 KB

@MihaZupan MihaZupan requested a review from a team February 6, 2020 14:53
@davidsh davidsh added this to the 5.0 milestone Feb 6, 2020
@EgorBo
Copy link
Member

EgorBo commented Feb 6, 2020

Interesting, LLVM managed to vectorize a similar code (in C++): https://godbolt.org/z/oetv_u 🙂 (just saying)

@GrabYourPitchforks
Copy link
Member

The majority of the clauses in the if statement aren't necessary. For instance, here is untested optimized code:

// This method implements the ABNF checks per https://tools.ietf.org/html/rfc3987#section-2.2
internal static bool CheckIriUnicodeRange(char highSurr, char lowSurr, ref bool surrogatePair, bool isQuery)
{
    bool inRange = false;
    surrogatePair = false;

    Debug.Assert(char.IsHighSurrogate(highSurr));

    if (Rune.TryCreate(highSurr, lowSurr, out Rune rune))
    {
        surrogatePair = true;

        // U+xxFFFE..U+xxFFFF is always private use for all planes, so we exclude it.
        // U+E0000..U+E0FFF is disallowed per the 'ucschar' definition in the ABNF.
        // U+F0000 and above are only allowed for 'iprivate' per the ABNF (isQuery = true).

        inRange = ((ushort)rune.Value < 0xFFFE)
            && ((uint)(rune.Value - 0xE0000) >= (uint)(0xE1000 - 0xE0000))
            && (isQuery || rune.Value < 0xF0000);
    }

    return inRange;
}

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 10, 2020

Should Rune.TryCreate be prefered over char.IsLowSurrogate and char.ConvertToUtf32? As far as I can tell they are effectively identical.

Regarding ranges, can you comment on the non-surrogate-pair version of CheckIriUnicodeRange.

@GrabYourPitchforks
Copy link
Member

Should Rune.TryCreate be prefered over char.IsLowSurrogate and char.ConvertToUtf32? As far as I can tell they are effectively identical.

The only real difference is that the Rune.TryCreate performs the check once (as it's a single method call), while char.IsLowSurrogate and char.ConvertToUtf32 perform the check twice.

@MihaZupan MihaZupan force-pushed the uri-cleanup-checkiriunicoderange branch from a832074 to 7b1917a Compare February 14, 2020 22:08
@MihaZupan
Copy link
Member Author

I used the optimization for similar range checks @GrabYourPitchforks suggested, only changing the cast to ushort to AND 0xFFFF.

I verified that the behaviour for all inputs is the same.

@stephentoub stephentoub reopened this Feb 18, 2020
@MihaZupan MihaZupan merged commit b00f349 into dotnet:master Feb 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants