Skip to content

Remove implicit anchoring optimization from Regex #42408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 19, 2020

Conversation

stephentoub
Copy link
Member

In .NET 5 we added a bunch of optimizations to Regex. One of them was a transform that optimized for the case where the pattern begins with .*. If it does, then we insert an implicit anchor at the beginning in order to avoid unnecessary backtracking. Imagine the pattern .*a and the pattern bcdefghijklmnopqrstuvwxyz. This is going to start matching at b, find the next newline, and then backtrack from there looking for the a; it won't find it and will backtrack all the way, failing the match at that position. At that point it'll bump to the next position, starting at c, and do it all over. It'll fail, backtrack all the way, and bump again, starting at d, and doing it all over. Etc. The optimization recognizes that since . will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail.

However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text. In that case, the implicitly added anchor will fail the match in the actual "Go" matching logic.

There are safe ways to do this optimization, e.g. introducing a variant of these anchors that let FindFirstChar skip ahead but that aren't considered for Go's matching purposes, but we can look at employing those for .NET 6. For now for .NET 5, this commit just deletes the faulty optimization and adds a few tests that were failing it.

#42390
#42392

cc: @pgovind, @eerhardt, @danmosemsft

@ghost
Copy link

ghost commented Sep 17, 2020

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member Author

/backport to release/5.0-rc2

@github-actions
Copy link
Contributor

Started backporting to release/5.0-rc2: https://github.com/dotnet/runtime/actions/runs/260042103

@danmoseley danmoseley requested a review from pgovind September 17, 2020 20:50
Copy link

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the unit test nit that Dan pointed out, otherwise the change LGTM!

In .NET 5 we added a bunch of optimizations to Regex.  One of them was a transform that optimized for the case where the pattern begins with `.*`.  If it does, then we insert an implicit anchor at the beginning in order to avoid unnecessary backtracking.  Imagine the pattern `.*a` and the pattern `bcdefghijklmnopqrstuvwxyz`.  This is going to start matching at `b`, find the next newline, and then backtrack from there looking for the `a`; it won't find it and will backtrack all the way, failing the match at that position.  At that point it'll bump to the next position, starting at `c`, and do it all over.  It'll fail, backtrack all the way, and bump again, starting at `d`, and doing it all over.  Etc.  The optimization recognizes that since `.` will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail.

However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text.  In that case, the implicitly added anchor will fail the match in the actual "Go" matching logic.

There are safe ways to do this optimization, e.g. introducing a variant of these anchors that let FindFirstChar skip ahead but that aren't considered for Go's matching purposes, but we can look at employing those for .NET 6.  For now for .NET 5, this commit just deletes the faulty optimization and adds a few tests that were failing it.
@jeffhandley
Copy link
Member

Remaining check timeouts are unrelated.

@jeffhandley jeffhandley merged commit a3f643b into dotnet:master Sep 19, 2020
@stephentoub stephentoub deleted the fixregexopt branch October 12, 2020 00:28
@stephentoub stephentoub restored the fixregexopt branch October 12, 2020 00:28
@stephentoub stephentoub deleted the fixregexopt branch October 12, 2020 00:28
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants