-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Remove implicit anchoring optimization from Regex #42408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley |
/backport to release/5.0-rc2 |
Started backporting to release/5.0-rc2: https://github.com/dotnet/runtime/actions/runs/260042103 |
src/libraries/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the unit test nit that Dan pointed out, otherwise the change LGTM!
In .NET 5 we added a bunch of optimizations to Regex. One of them was a transform that optimized for the case where the pattern begins with `.*`. If it does, then we insert an implicit anchor at the beginning in order to avoid unnecessary backtracking. Imagine the pattern `.*a` and the pattern `bcdefghijklmnopqrstuvwxyz`. This is going to start matching at `b`, find the next newline, and then backtrack from there looking for the `a`; it won't find it and will backtrack all the way, failing the match at that position. At that point it'll bump to the next position, starting at `c`, and do it all over. It'll fail, backtrack all the way, and bump again, starting at `d`, and doing it all over. Etc. The optimization recognizes that since `.` will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail. However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text. In that case, the implicitly added anchor will fail the match in the actual "Go" matching logic. There are safe ways to do this optimization, e.g. introducing a variant of these anchors that let FindFirstChar skip ahead but that aren't considered for Go's matching purposes, but we can look at employing those for .NET 6. For now for .NET 5, this commit just deletes the faulty optimization and adds a few tests that were failing it.
146e300
to
7c1a0da
Compare
Remaining check timeouts are unrelated. |
In .NET 5 we added a bunch of optimizations to Regex. One of them was a transform that optimized for the case where the pattern begins with
.*
. If it does, then we insert an implicit anchor at the beginning in order to avoid unnecessary backtracking. Imagine the pattern.*a
and the patternbcdefghijklmnopqrstuvwxyz
. This is going to start matching atb
, find the next newline, and then backtrack from there looking for thea
; it won't find it and will backtrack all the way, failing the match at that position. At that point it'll bump to the next position, starting atc
, and do it all over. It'll fail, backtrack all the way, and bump again, starting atd
, and doing it all over. Etc. The optimization recognizes that since.
will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail.However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text. In that case, the implicitly added anchor will fail the match in the actual "Go" matching logic.
There are safe ways to do this optimization, e.g. introducing a variant of these anchors that let FindFirstChar skip ahead but that aren't considered for Go's matching purposes, but we can look at employing those for .NET 6. For now for .NET 5, this commit just deletes the faulty optimization and adds a few tests that were failing it.
#42390
#42392
cc: @pgovind, @eerhardt, @danmosemsft