-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Pass more information about execution mode to RegexRunners #68242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsSome engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work. Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions. As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner. This makes Count, EnumerateMatches, and Replace (when there are no backreferences in the replacement) ~5x faster with the NonBacktracking engine.
|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Outdated
Show resolved
Hide resolved
@@ -184,7 +185,7 @@ protected internal virtual void Scan(ReadOnlySpan<char> text) | |||
} | |||
|
|||
runmatch = null; | |||
match.Tidy(runtextpos, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 182 -- if (mode == RegexRunnerMode.Existence)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment. Can you elaborate?
the public doc for quick is merely: "true to search for a match in quick mode; otherwise, false." ... |
...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Outdated
Show resolved
Hide resolved
I see I also mentioned quick mode here We haven't been maintaining this md. What to do with it? I believe we were planning to create something analogous for non backtracking. Perhaps this does need an update, when the code stops changing. Alternatively we could delete it. |
For reference, it's only exposed publicly in a protected API there's no reason for anyone to use, and we should obsolete the overloads as part of #62573. |
Done |
...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs
Show resolved
Hide resolved
First time I ever see that file 😆. If we are only removing it now because it is outdated, but we want to bring it back once we are done with the bulk of our work for .NET 7, I'm happy to take a stab at re-adding those docs with the updated info. |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Split.cs
Show resolved
Hide resolved
...braries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunnerMode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work. Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions. As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.
Some engines, in particular NonBacktracking, are relatively pay-for-play, in that the more information you need, the more processing they do. However, we currently don't pass enough information down to the RegexRunner to allow the engine to take full advantage. Today NonBacktracking can short-circuit its evaluation if it's told IsMatch is being used, but with the new EnumerateMatches and Count (and some uses of Replace), it's still gathering up all of the capture information even though that capture information will be ignored. This commit introduces a new RegexRunnerMode enum that lets us pass down to the engine exactly what portion of the information is needed, allowing it to avoid unnecessary work.
Related, we can reduce the amount of work performed by Match.Tidy: if the captures information won't be used, there's no point in fixing up the positions.
As part of this, I noticed we have a race condition in the new EnumerateMatches. We want to extract the index, length, and new text position from the Match object in order to populate the enumerator and result structs, but today we're doing so after the runner is returned to the cache. That means another thread could come along and start using that same Match object while we're still using it in the EnumerateMatches call. The fix is to extract the data from the Match before returning the runner.
This makes Count, EnumerateMatches, and Replace (when there are no backreferences in the replacement) ~5x faster with the NonBacktracking engine.
Fixes #67980