-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Fix balancing groups incorrectly removed from negative lookarounds #120856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where balancing groups inside negative lookarounds were incorrectly removed during regex optimization, causing patterns like ()(?'-1')(?!(?'-1')) to fail to match when they should. The fix preserves balancing groups while still removing regular capture groups from negative lookarounds.
- Modified
RemoveCapturesmethod to check if a capture node is a balancing group before removing it - Added test cases to verify correct matching behavior for balancing groups in negative lookarounds
- All existing tests continue to pass, confirming the optimization for regular captures still works
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs |
Added check to preserve balancing groups (N != -1) while removing regular captures (N == -1) from negative lookarounds |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs |
Added test case verifying single match behavior with balancing groups in negative lookarounds |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Count.Tests.cs |
Added test case verifying correct match count for pattern with balancing groups in negative lookarounds |
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
tarekgh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
/backport to release/10.0-staging |
|
Started backporting to release/10.0-staging: https://github.com/dotnet/runtime/actions/runs/18609604713 |
|
@stephentoub an error occurred while backporting to "release/10.0-staging", please check the run log for details! Error: The specified backport target branch "release/10.0-staging" wasn't found in the repo. |
i believe there's no release/10.0-staging yet? |
That's what the message above says, yes. |
|
/backport to release/10.0 |
|
Started backporting to release/10.0: https://github.com/dotnet/runtime/actions/runs/18639594553 |
… lookarounds (#120888) Backport of #120856 to release/10.0 /cc @stephentoub @Copilot ## Customer Impact - [x] Customer reported - [ ] Found internally Balancing groups inside of negative lookarounds get removed erroneously, erroneously changing the behavior of those (rare but possible) regexes. ## Regression - [x] Yes - [ ] No Last .NET 10 preview. ## Testing Additional unit tests added. ## Risk Low. The fix is adding an extra condition to the if guarding an optimization; it simply makes the optimization apply less often. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
…120983) Closes #117605 ## Summary This PR adds comprehensive regression tests to verify that captures made within positive lookbehind assertions are preserved correctly across all regex engines (Interpreter, Compiled, and SourceGenerated). ## Background Issue #117605 reported that in Compiled mode, the .NET regex engine was incorrectly removing captures made by capturing groups within positive lookbehind assertions, while the Interpreted mode preserved them correctly. For example: ```csharp string pattern = "(?<=(abc)+?123)a"; string input = "abcabc123a"; var compiled = new Regex(pattern, RegexOptions.Compiled); var interpreted = new Regex(pattern, RegexOptions.None); var matchCompiled = compiled.Match(input); var matchInterpreted = interpreted.Match(input); // Expected: Both should have 1 capture in Group[1] // Bug: Compiled showed 0 captures, Interpreted showed 1 capture ``` ## Changes Upon investigation, I found that **the issue has already been fixed** in the current codebase. Both Compiled and Interpreted modes now produce identical results for all test cases mentioned in the issue. However, the existing tests in `Match_MemberData_Cases` only verified match success and matched values, not the actual capture counts and values. This PR adds comprehensive tests to `Match_Advanced_TestData` that specifically verify: - Capture counts match across all engines - Capture values, indices, and lengths are correct - Various quantifier patterns work correctly (`+?`, `{2,}?`, etc.) - Multiple capturing groups in lookbehinds are handled properly ## Tests Added 1. **Main issue case**: `(?<=(abc)+?123)a` - Verifies lazy quantifier in lookbehind 2. **Multiple captures**: `(?<=(abc){2,}?123)a` - Verifies bounded quantifier with 2 captures 3. **Simple lookbehind**: `(?<=(abc)+?)a` - Verifies basic lazy quantifier behavior 4. **Multiple groups**: `(?<=(?'1'abc)+?(?'2')123)a` - Verifies multiple named groups in lookbehind All tests properly handle the NonBacktracking engine (which doesn't support lookbehinds) and run against all other available engines. ## Test Results All 30,335 regex tests pass successfully with no regressions. <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>Compiled regex incorrectly removes the captures which captured in positive lookbehind assertions</issue_title> > <issue_description>### Description > > I saw u added a test `yield return (@"(?<=(abc)+?123)", "abcabc123", RegexOptions.None, 0, 9, true, "");` on line 135 in [file Regex.Match.Tests.cs](https://github.com/dotnet/runtime/pull/117629/files#diff-2bdc6a6839e8ce7675737db0067c7cf288d190e0af1559feed81d5be92c9b216). > Based on this, I found a new bug: > In Compiled mode, when leaving positive lookbehind assertions, the .NET regex engine will remove all the captures which captured by the last direct group construction(the last direct child node). > > > -------------------------------- > By the way, I found that when leaving negative assertions, the .NET regex engine would not delete captures which captured in the assertion on .NET8, but this bug fixed on .NET9. > For example: > ```c# > using System; > using System.Text.RegularExpressions; > > Console.WriteLine(new Regex("(?<!x())a",RegexOptions.Compiled).Match("a").Groups[1].Captures.Count); > Console.WriteLine(new Regex("(?!()x)a",RegexOptions.Compiled).Match("a").Groups[1].Captures.Count); > ``` > Output on .NET8: > ``` > 1 > 1 > ``` > Maybe they are related? > > ### Reproduction Steps > > ```c# > using System; > using System.Text.RegularExpressions; > > string input = "abcabc123a"; > string pattern="(?<=(abc)+?123)a"; > //pattern="(?'1')(?<=(?'1'abc)+?(?#<=just delete the left capture)(?'1'abc)+?123)a"; > //pattern="(?'1')(?<=(?'1'abc)+?(?'2')123)a";//the capture of group#1 was removed, but the capture of group#2 wasn't > //pattern="(?<=(abc){2,}?123)a";//two captures of group#1 which captured in the positive lookbehind assertions(?<=) are both deleted > > //pattern="(?<=(abc){2}?123)a";//not deleted > //pattern=@"(?<=\1?(abc){2,}?123)a";//not deleted > //pattern="(?<=(?(1))(abc){2,}?123)a";//not deleted > //pattern="(?<=(?'-1')?(abc){2,}?123)a";//not deleted > > //pattern="1(?=x(abc)+?)";input="1xabc";//not deleted in positinve lookahead assertions > //pattern="1(?=(abc)+?x)";input="1abcx";//not deleted in positinve lookahead assertions > > try > { > Console.WriteLine("----------Compiled----------"); > Match matchCompiled = new Regex(pattern, RegexOptions.Compiled).Match(input); > Console.WriteLine(matchCompiled.Value); > Console.WriteLine(matchCompiled.Groups[1].Captures.Count); > Console.WriteLine("Groups#1 Captures:"); > foreach(Capture c in matchCompiled.Groups[1].Captures) > Console.WriteLine($"{c.Value},{c.Index},{c.Length}"); > > Console.WriteLine("----------Interpreted----------"); > Match Interpreted= new Regex(pattern, RegexOptions.None).Match(input); > Console.WriteLine(Interpreted.Value); > Console.WriteLine(Interpreted.Groups[1].Captures.Count); > Console.WriteLine("Groups#1 Captures:"); > foreach(Capture c in Interpreted.Groups[1].Captures) > Console.WriteLine($"{c.Value},{c.Index},{c.Length}"); > > Console.WriteLine("----------result comparion----------"); > Console.WriteLine($"Result equal:{matchCompiled.Groups[1].Captures.Count == Interpreted.Groups[1].Captures.Count}"); > } > catch(Exception ex) > { > Console.WriteLine(ex.Message); > } > ``` > > ### Expected behavior > > Output: > ``` > ----------Compiled---------- > a > 1 > Groups#1 Captures: > abc,3,3 > ----------Interpreted---------- > a > 1 > Groups#1 Captures: > abc,3,3 > ----------result comparion---------- > Result equal:True > ``` > > ### Actual behavior > > Output: > ``` > ----------Compiled---------- > a > 0 > Groups#1 Captures: > ----------Interpreted---------- > a > 1 > Groups#1 Captures: > abc,3,3 > ----------result comparion---------- > Result equal:False > ``` > > ### Regression? > > _No response_ > > ### Known Workarounds > > _No response_ > > ### Configuration > > _No response_ > > ### Other information > > _No response_</issue_description> > > <agent_instructions>Please check whether this issue still repros. If it does, please fix it and add a test. If it doesn't, please just ensure there's one or more tests covering the cited issues.</agent_instructions> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > <comment_new><author>@stephentoub</author><body> > > Did it fixed by #120856 yet? > > No, that issue is unrelated. > > That label issue still exists in the source generator. It's adding braces in a way that's preventing the goto from seeing the target label, which exists, just not in scope. I'd need to debug through to see why the braces are being added.</body></comment_new> > </comments> > </details> Fixes #117651 <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Summary
Fixed a bug where balancing groups were incorrectly being removed from negative lookarounds during regex tree reduction, causing patterns like
()(?'-1')(?!(?'-1'))to incorrectly return 0 matches instead of matching at every position.Root Cause
The issue was introduced by an optimization in the
RemoveCapturesmethod withinReduceLookaroundinRegexNode.cs. This method removes capture groups from negative lookarounds since captures inside negative lookarounds are undone after the lookaround completes. However, it was incorrectly removing ALL capture groups, including balancing groups.Balancing groups (e.g.,
(?'-1')) have semantic meaning that affects matching behavior - they require a specific capture group to have been captured before they can succeed. Removing them changes the match semantics.The Fix
Modified the
RemoveCapturesmethod to preserve balancing groups by checking ifN != -1(where N stores the uncapture group number). The code now uses pattern matching for cleaner syntax:if (node is { Kind: RegexNodeKind.Capture, N: -1 }). Updated comments to clarify that captures that don't rely on or impact persisted state can be removed, which includes backreferences and balancing groups.Changes Made
RemoveCapturesto checknode.N == -1before removing captures, using pattern matchingTest Results
()(?'-1')(?!(?'-1'))now correctly matches at every position in the input stringOriginal prompt
Fixes #120849
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.