Pass more information about execution mode to RegexRunners #68242

stephentoub · 2022-04-20T00:46:06Z

Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work.

Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions.

As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.

This makes Count, EnumerateMatches, and Replace (when there are no backreferences in the replacement) ~5x faster with the NonBacktracking engine.

Fixes #67980

ghost · 2022-04-20T00:46:13Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work.

Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions.

As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.

This makes Count, EnumerateMatches, and Replace (when there are no backreferences in the replacement) ~5x faster with the NonBacktracking engine.

Author:	stephentoub
Assignees:	stephentoub
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

danmoseley · 2022-04-20T01:00:58Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

@@ -184,7 +185,7 @@ protected internal virtual void Scan(ReadOnlySpan<char> text)
                }

                runmatch = null;
-                match.Tidy(runtextpos, 0);


line 182 -- if (mode == RegexRunnerMode.Existence) ?

I don't understand the comment. Can you elaborate?

danmoseley · 2022-04-20T01:02:45Z

the public doc for quick is merely: "true to search for a match in quick mode; otherwise, false." ...

...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

danmoseley · 2022-04-20T01:13:55Z

I see I also mentioned quick mode here
https://github.com/dotnet/runtime/blob/327291967503180b4079720063819da59c3395e3/src/libraries/System.Text.RegularExpressions/src/README.md#L153

We haven't been maintaining this md. What to do with it? I believe we were planning to create something analogous for non backtracking. Perhaps this does need an update, when the code stops changing. Alternatively we could delete it.

stephentoub · 2022-04-20T03:12:59Z

the public doc for quick is merely: "true to search for a match in quick mode; otherwise, false."

For reference, it's only exposed publicly in a protected API there's no reason for anyone to use, and we should obsolete the overloads as part of #62573.

stephentoub · 2022-04-20T03:22:52Z

Alternatively we could delete it.

Done

...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs

joperezr · 2022-04-20T21:49:38Z

We haven't been maintaining this md. What to do with it? I believe we were planning to create something analogous for non backtracking. Perhaps this does need an update, when the code stops changing. Alternatively we could delete it.

First time I ever see that file 😆. If we are only removing it now because it is outdated, but we want to bring it back once we are done with the bulk of our work for .NET 7, I'm happy to take a stab at re-adding those docs with the updated info.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Split.cs

...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

joperezr

Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work. Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions. As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.

stephentoub requested review from joperezr and olsaarik April 20, 2022 00:46

ghost assigned stephentoub Apr 20, 2022

ghost added the area-System.Text.RegularExpressions label Apr 20, 2022

danmoseley reviewed Apr 20, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Outdated Show resolved Hide resolved

danmoseley reviewed Apr 20, 2022

View reviewed changes

...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs Outdated Show resolved Hide resolved

danmoseley reviewed Apr 20, 2022

View reviewed changes

....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs Outdated Show resolved Hide resolved

danmoseley reviewed Apr 20, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Outdated Show resolved Hide resolved

stephentoub force-pushed the regexmode branch from d9f5c99 to 130750a Compare April 20, 2022 11:59

runfoapp bot mentioned this pull request Apr 20, 2022

System.Text.RegularExpressions.Tests.RegexKnownPatternTests.Docs_Examples_ValidateEmail failure #68286

Closed

joperezr reviewed Apr 20, 2022

View reviewed changes

...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs Show resolved Hide resolved

joperezr reviewed Apr 20, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Split.cs Show resolved Hide resolved

joperezr reviewed Apr 20, 2022

View reviewed changes

...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs Outdated Show resolved Hide resolved

joperezr reviewed Apr 20, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Outdated Show resolved Hide resolved

joperezr approved these changes Apr 20, 2022

View reviewed changes

stephentoub added 3 commits April 20, 2022 20:02

Address PR feedback

8cfba16

Address PR feedback

03ae0f8

stephentoub force-pushed the regexmode branch from 130750a to 03ae0f8 Compare April 21, 2022 00:06

stephentoub merged commit 81d8b31 into dotnet:main Apr 21, 2022

stephentoub deleted the regexmode branch April 21, 2022 15:06

ghost locked as resolved and limited conversation to collaborators May 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pass more information about execution mode to RegexRunners #68242

Pass more information about execution mode to RegexRunners #68242

Uh oh!

stephentoub commented Apr 20, 2022 •

edited

Loading

Uh oh!

ghost commented Apr 20, 2022

Uh oh!

Uh oh!

danmoseley Apr 20, 2022

Uh oh!

stephentoub Apr 20, 2022

Uh oh!

danmoseley commented Apr 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmoseley commented Apr 20, 2022

Uh oh!

stephentoub commented Apr 20, 2022

Uh oh!

stephentoub commented Apr 20, 2022

Uh oh!

Uh oh!

joperezr commented Apr 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joperezr left a comment

Uh oh!

Uh oh!

Pass more information about execution mode to RegexRunners #68242

Pass more information about execution mode to RegexRunners #68242

Uh oh!

Conversation

stephentoub commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Apr 20, 2022

Uh oh!

Uh oh!

danmoseley Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

stephentoub Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

danmoseley commented Apr 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmoseley commented Apr 20, 2022

Uh oh!

stephentoub commented Apr 20, 2022

Uh oh!

stephentoub commented Apr 20, 2022

Uh oh!

Uh oh!

joperezr commented Apr 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joperezr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephentoub commented Apr 20, 2022 •

edited

Loading