Skip to content

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented Apr 20, 2020

Profiling shows that RegexOptions.Compiled | RegexOptions.IgnoreCase regexes are spending a non-trivial amount of time invoking char.ToLower(c, _culture), and a good chunk of this is just in making the virtual call to _culture.TextInfo inside of char.ToLower. Instead of caching a CultureInfo and calling char.ToLower(c, _culture), RegexCompiler can cache the TextInfo once and then call _textInfo.ToLower(c), saving a virtual call on every character comparison.

That's the second commit. The first commit addresses why this change hasn't been done until now. We'd previously made the same change in RegexInterpreter, but in RegexCompiler, we need a temporary in order to swap parameters to make the ToLower lower call (by the time CallToLower is invoked, the char is already on the evaluation stack). So, the first commit fixes how RegexCompiler manages temporary locals. Previously, the developer would declare a few temporary locals at the beginning of the method, and then manually figure out how to use them in a safe manner. With this change, we instead maintain a pool of locals that can be rented from just where they're needed. That makes it possible for a helper like CallToLower to just grab a local for the brief period it needs one.

This PR handles the RegexOptions.IgnoreCase scenario. #35194 covers the RegexOptions.IgnoreCase | RegexOptions.InvariantCulture scenario, as there RegexCompiler uses ToLowerInvariant.

Review without whitespace diffs: https://github.com/dotnet/runtime/pull/35203/files?w=1
cc: @GrabYourPitchforks, @eerhardt, @pgovind

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private readonly Regex _regex = new Regex("Hello, \\w+.", RegexOptions.Compiled | RegexOptions.IgnoreCase);
    private readonly string _input = new string("This is a test to see how well this does.  Hello, DotNetBot.");

    [Benchmark] public bool Test() => _regex.IsMatch(_input);
}
Method Job Toolchain Mean Error StdDev Ratio
Test Job-ABLHBF \master\corerun.exe 207.0 ns 0.42 ns 0.37 ns 1.00
Test Job-VWJGVI \pr\corerun.exe 149.9 ns 0.33 ns 0.25 ns 0.72

Currently GenerateGo/FindFirstChar create a few _tempLocal1, _tempLocal2, etc. LocalBuilders that various parts of the methods may use.  The developer then need to keep track of which of these locals is currently being used such that we maximize the re-use of them while not accidentally overlapping their use.  And the names end up being non-descriptive, so in some cases additional names are added for convenience.

The primary purpose of the PR is cleanup.  It adds a pool of LocalBuilders that lets them be rented just where they're needed, such that we only create locals when necessary, can give them appropriate names, etc.  However, it also makes it a lot easier for helpers to create temporary locals (e.g. as was done in EmitMatchCharacterClass) which then makes it easier to add additional optimizations.
To support RegexOptions.IgnoreCase, RegexCompiler ends up generating a call to `char.ToLower(c, _culture)` for every character compared.  That `ToLower` call in turn needs to access `_culture.TextInfo`, and then call its `ToLower(char)` virtual method.  We can instead get the `TextInfo` once and call its `ToLower` method directly, saving a virtual call on every character being compared.

Note that we were already doing this (for .NET 5) in RegexInterpreter, but hadn't done it in RegexCompiler because of complications around locals management.  The previous commit fixed that and made this possible.
@stephentoub stephentoub added this to the 5.0 milestone Apr 20, 2020
@ghost
Copy link

ghost commented Apr 20, 2020

Tagging subscribers to this area: @eerhardt
Notify danmosemsft if you want to be subscribed.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Apr 20, 2020

@stephentoub We recently changed TextInfo.ToUpper(char) / ToLower(char) to be non-virtual. Is this something you could take advantage of in your code by emitting call instead of callvirt? Or do you need to retain compatibility with downlevel platforms?

(Edit: I suppose you might be relying on the implicit null check that callvirt performs.)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super clean. I like how this turned out vs the 3 _tempXLocal variables.

_int32LocalsPool ??= new Stack<LocalBuilder>(),
_int32LocalsPool.TryPop(out LocalBuilder? iterationLocal) ? iterationLocal : DeclareInt32());

/// <summary>Rents an ReadOnlySpan(char) local variable slot from the pool of locals.</summary>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(super nit) Rents an ReadOnlySpan => Rents a ReadOnlySpan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll fix the comment in the other PR, to avoid restarting CI.

@stephentoub
Copy link
Member Author

We recently changed TextInfo.ToUpper(char) / ToLower(char) to be non-virtual. Is this something you could take advantage of in your code by emitting call instead of callvirt? Or do you need to retain compatibility with downlevel platforms? (Edit: I suppose you might be relying on the implicit null check that callvirt performs.)

It'll still invoke it non-virtually, and potentially even inline it; it'll just include the null check if needed. You think it's worthwhile changing it to be call instead? Last I knew, in general the C# compiler emits callvirt even for non-virtual methods, except in some circumstances, such as with the ?. operator.

If we want to change it, I think we should change other uses of callvirt as well, e.g. various string methods are currently invoked with callvirt instead of call:

@stephentoub stephentoub merged commit ab4a117 into dotnet:master Apr 21, 2020
@stephentoub stephentoub deleted the poolregexcompiler branch April 21, 2020 01:09
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants