-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Two-phase matching algorithm for NonBacktracking #68199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
First phase now finds the true match end position. The implicit .* is now a lazy .*? to prioritize the earliest match. Third phase is now only run for subcaptures, which no longer needs to find match end position. Remove counter optimization that no longer applies with OrderedOr. Fix a problem in SymbolicRegexInfo where begin/end anchors were marked as line anchors. Also remove dead fields from SymbolicRegexInfo. Fix captures not being handled for empty matches at start of input.
Especially fix comments for the new 2-phase match generation algorithm.
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis PR changes the The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point). Critically, the "dot starred" version of a regex One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:
|
| if (startsWithLineAnchor) | ||
| { | ||
| i |= ContainsLineAnchorMask; | ||
|
|
||
| if (startsWithLineAnchor) | ||
| { | ||
| i |= StartsWithLineAnchorMask; | ||
| } | ||
| i |= StartsWithLineAnchorMask; | ||
| } | ||
|
|
||
| if (startsWithBoundaryAnchor) | ||
| if (startsWithLineAnchor || startsWithSomeAnchor) | ||
| { | ||
| i |= StartsWithBoundaryAnchorMask; | ||
| i |= StartsWithSomeAnchorMask; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
To retain the style of the rest of the checks, this could be:
if (startsWithLineAnchor || startsWithSomeAnchor)
{
i |= StartsWithSomeAnchorMask;
if (startsWithLineAnchor)
{
i |= StartsWithLineAnchorMask
}
}| // Create the dot-star pattern (a concatenation of any* with the original pattern) | ||
| // and all of its initial states. | ||
| _dotStarredPattern = _builder.CreateConcat(_builder._anyStar, _pattern); | ||
| _dotStarredPattern = _builder.CreateConcat(_builder._anyStarLazy, _pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should we rename _dotStarredPattern to _lazyDotStarredPattern?
| // Find the minterm, handling the special case for the last \n for states that start with a relevant anchor | ||
| int mintermId = c == '\n' && i == input.Length - 1 && TStateHandler.StartsWithLineAnchor(ref state) ? | ||
| builder._minterms!.Length : // mintermId = minterms.Length represents \Z (last \n) | ||
| builder._minterms!.Length : // mintermId = minterms.Length represents an \n at the very end of input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| builder._minterms!.Length : // mintermId = minterms.Length represents an \n at the very end of input | |
| builder._minterms!.Length : // mintermId = minterms.Length represents a \n at the very end of input |
| // that first b until it finds the 6th a: aaaaaaaaaab. | ||
| // that last b until it finds the 4th a: aaabbbc. | ||
| int matchStart; | ||
| if (matchStartLengthMarker >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are markers working again? They weren't previously:
#65532
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Show resolved
Hide resolved
| currentState = new CurrentState(nfaState); | ||
| if (i < inputForInnerLoop.Length) | ||
| { | ||
| // We failed to transition. Upgrade to DFA mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // We failed to transition. Upgrade to DFA mode. | |
| // We failed to transition. Upgrade to NFA mode. |
| // If this is an isMatch call we are done, since a match is now known to exist. | ||
| if (isMatch) | ||
| return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| // If this is an isMatch call we are done, since a match is now known to exist. | |
| if (isMatch) | |
| return 1; | |
| // If this is an IsMatch call, we are done, since a match is now known to exist. | |
| if (isMatch) | |
| return 1; |
| i = pos; | ||
| finalStatePosition = 0; | ||
| return -1; | ||
| Debug.Fail("No nullable state found in the set of end states"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| Debug.Fail("No nullable state found in the set of end states"); | |
| Debug.Fail("No nullable state found in the set of end states"); |
stephentoub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I only had nits, so since CI is green, I'm going to merge, and we can address any desired nits subsequently.
* Switch to 2-phase matching in NonBacktracking First phase now finds the true match end position. The implicit .* is now a lazy .*? to prioritize the earliest match. Third phase is now only run for subcaptures, which no longer needs to find match end position. Remove counter optimization that no longer applies with OrderedOr. Fix a problem in SymbolicRegexInfo where begin/end anchors were marked as line anchors. Also remove dead fields from SymbolicRegexInfo. Fix captures not being handled for empty matches at start of input. * Improve comments for NonBacktracking Especially fix comments for the new 2-phase match generation algorithm. * Add a failing test for the earlier NonBacktracking * Avoid transitions to deadends for capuring NFA
|
There were regex related regressions: dotnet/perf-autofiling-issues#4731 As far as I can tell this commit seems to be the cause, but I'm not certain. |
It almost certainly is. @olsaarik, this was necessary for correctness, but did we expect this much of a hit? |
|
Possibly another regression here: dotnet/perf-autofiling-issues#4756 |
|
@stephentoub, the biggest known cause of regressions is that while making this PR I realized that the counter subsumption optimization no longer holds under the backtracking simulating semantics and thus had to remove it. Losing that can for sure have significant impact on some patterns with large counters, which will have a larger DFA state space. Should investigate still though, I did reorganize some of the innermost matching loop logic, but not in a way that I'd expect to cause any measurable regressions. |
|
Thanks, @olsaarik. That's good to know, but I don't think it explains this, as some of the worst regressions here don't involve any counters. |
|
Possible improvement (windows x64): dotnet/perf-autofiling-issues#4960 |
Hmm, it's unlikely to be this. That links shows improvements for both RegexOptions.None and RegexOptions.NonBacktracking, and this PR would have only impacted the latter. |
There is also dotnet/perf-autofiling-issues#4939. Range of diffs for both is 4881a63...eb9dd67 so perhaps #67941 is responsible? |
Those are both NonBacktracking and so could be this PR. It's that Options None one that's confusing. (#67941 isn't used by regex yet.) |
|
More regressions (all non-backtracking):
(note these may be dups, we seem to be double-filing some cases) |




This PR changes the
RegexOptions.NonBacktrackingmatching algorithm to a mostly 2-phase one, fixing #65607. Currently the first phase only walks to the first final state position, relying on a third phase to extend the match as far as possible. Now that NonBacktracking is aiming to fully match the semantics of the backtracking engines, this is problematic for some patterns. For example,.{5}Foo|BaronFooBarFooshould match onooBarFoo, but with the current algorithm it producesBar, the problem being that the first phase finds the first final state at the end ofBarand then fails to walk backwards form that to the starting point of the preferred match.The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point).
Critically, the "dot starred" version of a regex
R, which the first phase operates on is now using a lazy loop:.*?R. With the backtracking simulation this causes a partial matchTto appear on the left side of the core pattern:T|.*?R. IfTever becomes nullable, then any lower priority parts of the derivative will disappear, allowing the matching procedure to reach a deadend state when all earlier-starting match candidates have been exhausted or extended as far as possible.One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining
.{0,1}|.{0,2}into.{0,2}loses the information that the shorter match is preferred. This PR removes the optimization.I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:
SymbolicRegexMatcher.csare now in the order they are called.SymbolicRegexInfodidn't match for which anchors the special handling of an\nat the very end of the input was triggered. This was probably introduced before, but surfaced in the new matching algorithm. Removed unused bits fromSymbolicRegexInfoat the same time.