Skip to content

Pass more information about execution mode to RegexRunners #68242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 21, 2022

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Apr 20, 2022

Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work.

Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions.

As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.

This makes Count, EnumerateMatches, and Replace (when there are no backreferences in the replacement) ~5x faster with the NonBacktracking engine.

Fixes #67980

@ghost
Copy link

ghost commented Apr 20, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work.

Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions.

As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.

This makes Count, EnumerateMatches, and Replace (when there are no backreferences in the replacement) ~5x faster with the NonBacktracking engine.

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Text.RegularExpressions

Milestone: -

@@ -184,7 +185,7 @@ protected internal virtual void Scan(ReadOnlySpan<char> text)
}

runmatch = null;
match.Tidy(runtextpos, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 182 -- if (mode == RegexRunnerMode.Existence) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment. Can you elaborate?

@danmoseley
Copy link
Member

the public doc for quick is merely: "true to search for a match in quick mode; otherwise, false." ...

@danmoseley
Copy link
Member

I see I also mentioned quick mode here
https://github.com/dotnet/runtime/blob/327291967503180b4079720063819da59c3395e3/src/libraries/System.Text.RegularExpressions/src/README.md#L153

We haven't been maintaining this md. What to do with it? I believe we were planning to create something analogous for non backtracking. Perhaps this does need an update, when the code stops changing. Alternatively we could delete it.

@stephentoub
Copy link
Member Author

the public doc for quick is merely: "true to search for a match in quick mode; otherwise, false."

For reference, it's only exposed publicly in a protected API there's no reason for anyone to use, and we should obsolete the overloads as part of #62573.

@stephentoub
Copy link
Member Author

Alternatively we could delete it.

Done

@joperezr
Copy link
Member

We haven't been maintaining this md. What to do with it? I believe we were planning to create something analogous for non backtracking. Perhaps this does need an update, when the code stops changing. Alternatively we could delete it.

First time I ever see that file 😆. If we are only removing it now because it is outdated, but we want to bring it back once we are done with the bulk of our work for .NET 7, I'm happy to take a stab at re-adding those docs with the updated info.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do.  However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage.  Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored.  This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work.

Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions.

As part of this, I noticed we have a race condition in the new EnumerateMatches.  We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache.  That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call.  The fix is to extract the data from the Match before returning the runner.
@stephentoub stephentoub merged commit 81d8b31 into dotnet:main Apr 21, 2022
@stephentoub stephentoub deleted the regexmode branch April 21, 2022 15:06
@ghost ghost locked as resolved and limited conversation to collaborators May 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass more info about matching mode to Regex engine
3 participants