Regex bug fixes, refactoring (macroization) and code improvments #20918

demerphq · 2023-03-09T10:42:57Z

This PR sequence includes the non-controversial parts of #20677 and #20747.

It includes multiple clean up patches, regex engine bug fixes, and other improvements, including an important patch to simplify and standardize the structures we use to represent regexp opcodes (regops).

It does not include the controversial parts of the above PR's: neither the start_new/end_new pairing, nor the creation of an RXMO object (intended to separate out the compiled form of a regex from the results from that form). It does include many changes which are intended to facilitate the latter, such as wrapping the various regex structure accessors in macros so that they can be used to "hide" the split in the future.

The patch marked "regexec.c - incredibly inefficient solution to backref problem" and its much more efficient successor "regexec.c - make REF into a backtracking state" fixes #10073 - RT72020

The patch marked "regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers" adds tests and todo tests for #18865

The patch marked "regcomp.c - track parens related to CURLYX and CURLYM" includes prep work for tracking how parens are before and within a quantified group, and tests for various bug reports we have related to inconsistencies between CURLYM and CURLYX.

This PR either fixes or contains preparation to fix the following bugs: #20661, #19615, #18865, #14848, #13619, #10073, #8267

I would like to get this merged soon so that i can rebase the other two branches on top. @iabyn and I have been in correspondance about the changes in #20677 and for now I would like to move forward with most, but not all of that PR, this PR excludes the parts which @iabyn found to be controversial. I believe that nothing left in this PR is or should be controversial.

khwilliamson

The two commits that mention fix up should be squashed. I thought I was commenting on individual commits, but things showed up here.

khwilliamson

Typos in commit message

This patch is very crude. It simple pushes the state of the parens and creates
and unwind point
should be 'simply'
and should be 'an'

khwilliamson · 2023-03-13T00:38:24Z

I spent an hour-ish looking at this. The changes to the parts of the engine I'm familiar with look reasonable. I have never spent much time looking at the engine execution stack. Everything looks plausible, so I think this can be merged.

rex->maxlen holds the maximum length the pattern can match, not the minimum. The copy was obviously copied from the rex->minlen case, so fix it to be correct.

…th ampersand Tests with $& always pass in regexp_noamp.t (wrapper around regexp.t), so when they are TODO tests it looks like a TODO pass when in fact it is just an artifact of how we handle ampersand tests in this file. For these cases we simply do not mark them as TODO anymore

This was originally a patch which made somewhat drastic changes to how we represent capture buffers, which Dave M and I and are still discussing offline and which has a larger impact than is acceptable to address at the current time. As such I have reverted the controversial parts of this patch for now, while keeping most of it intact even if in some cases the changes are unused except for debugging purposes. This patch still contains valuable changes, for instance teaching CURLYX and CURLYM about how many parens there are before the curly[1] (which will be useful in follow up patches even if stricly speaking they are not directly used yet), tests and other cleanups. Also this patch is sufficiently large that reverting it out would have a large effect on the patches that were made on top of it. Thus keeping most of this patch while eliminating the controversial parts of it for now seemed the best approach, especially as some of the changes it introduces and the follow up patches based on it are very useful in cleaning up the structures we use to represent regops. [1] Curly is the regexp internals term for quantifiers, named after x{min,max} "curly brace" quantifiers.

In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well.

Backrefs to unclosed parens inside of a quantified group were not being properly handled, which revealed we are not unrolling the paren state properly on failure and backtracking. Much of the code assumes that when we execute a "conditional" operation (where more than one thing could match) that we need not concern ourself with the paren state unless the conditional operation itself represents a paren, and that generally opcodes only needed to concern themselves with parens to their right. When you exclude backrefs from the equation this is broadly reasonable (i think), as on failure we typically dont care about the state of the paren buffers. They either get reset as we find a new different accepting pathway, or their state is irrelevant if the overal match is rejected (eg it fails). However backreferences are different. Consider the following pattern from the tests "xa=xaaa" =~ /^(xa|=?\1a){2}\z/ in the first iteration through this the first branch matches, and in fact because the \1 is in the second branch it can't match on the first iteration at all. After this $1 = "xa". We then perform the second iteration. "xa" does not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets up a backtracking action to not match if the rest of the pattern does not match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in the string, we exit the quantifier loop with $1 = "=xaa" and match \z against the remaining "a" in the pattern, and fail. Here is where things go wrong in the old code, we unwind to the outer loop, but we do not unwind the paren state. We then unwind further into the 2nds iteration of the loop, to the '=?' where we then try to match the tail with the quantifier matching the empty string. We then match the old $1 (which was not unwound) as "=xaa", and then the "a" matches, and we are the end of the string and we have incorrectly accpeted this string as matching the pattern. What should have happend was when the \1 was resolved the second time it should have returned the same string as it did when the =? matched '=', which then would have resulted in the tail matching again, and etc, eventually unwinding the entire pattern when the second iteration failed entirely. This patch is very crude. It simply pushes the state of the parens and creates an unwind point for every case where we do a transition to a B or _next operation, and we make the corresponding _next_fail do the appropriate unwinding. The objective was to achieve correctness and then work towards making it more efficient. We almost certainly overstore items on the stack. In the next patch we will keep track of the unclosed parens before the relevant operators and make sure that they are properly pushed and unwound at the correct times. In other words this is a first step patch to make sure things are correct, the next patch will change it so we do it quickly.

This way we can do the required paren restoration only when it is in use. When we match a REF type node which is potentially a reference to an unclosed paren we push the match context information, currently for "everything", but in a future patch we can teach it to be more efficient by adding a new parameter to the REF regop to track which parens it should save. This converts the backtracking changes from the previous commit, so that it is run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which is by default 0. We don't make the new fields in the struct conditional as the stack frames are large and our changes don't make any real difference and it keeps things simpler to not have conditional members, especially since some of the structures have to line up with each other. If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means something is sensitive to us not necessarily restoring the parens properly on failure. We make some assumptions that the paren state after a failing state will be corrected by a future successful state, or that the state of the parens is irrelevant as we will fail anyway. This can be made not true by EVAL, backrefs, and potentially some other scenarios. Thus I have left this inefficient logic in place but guarded by the flag.

This eliminates the regnode_2L data structure, and merges it with the older regnode_2 data structure. At the same time it makes each "arg" property of the various regnode types that have one be consistently structured as an anonymous union like this: union { U32 arg1u; I32 arg2i; struct { U16 arg1a; U16 arg1b; }; }; We then expose four macros for accessing each slot: ARG1u() ARG1i() and ARG1a() and ARG1b(). Code then explicitly designates which they want. The old logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1, which was confusing to say the least. The regnode_2L structure had a U32 arg1, and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new set of macros we use the regnode_2 for both, and use the appropriate macros to show whether we want to signed or unsigned values. This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit args". However as each slot can also store two U16s, a regnode_3 can hold up to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count and ARG3a() and ARG3b() for parens before and inside the quantifier. It also changes the functions reganode() to reg1node() and changes reg2Lanode() to reg2node(). The 2L thing was just confusing.

this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.

This insulates access to the regexp match offset data so we can fix the define later and move the offset structure into a new struct. The RXp_OFFSp() was introduced in a recent commit to deliberately break anything using RXp_OFFS() directly. It is hard to type deliberately, nothing but the internals should use it. Everything else should use one of the wrappers around it.

Obviously this isn't required as we build fine. But doing this future proofs us to other changes.

This field will be moving to a new struct. Converting this to a macro will make that move easier.

We were missing various RXp_XXXX() and RX_XXXX() macros. This adds them so we can use them in places where we are unreasonable intimate with the regexp struct internals.

We will move some of these members out of the regexp structure into a new sub structucture. This isolates those changes to the macro definitions

We will move this struct member into a new struct in a future patch, and using the macros means we can reduce the number of places that needs to be explcitly aware of the new structure.

We will move this member to a new struct in the near future, converting all uses to a macro isolates that change.

This member of the regexp structure will be moved to a new structure in the near future. Converting to use the macro will make this change easier to manage.

We will migrate this struct member to a new struct in the near future this change will make that patch more minimal and hide the gory details.

This member of the regexp struct will soon be migrated to a new independent structure. This change ensure that when we do the migration the changes are restricted to the least code possible.

We will migrate this member to a new structure in the near future, wrapping with a macro makes that migration simpler and less invasive.

We will move various members of the regexp structure to a new structure which just contains information about the match. Wrapping the members in the standard macros means that change can be made less invasive. We already did all of this in regexec.c

demerphq · 2023-03-13T09:16:42Z

@khwilliamson fixups applied. Thanks for the review.

iabyn · 2023-03-17T21:09:06Z

On Mon, Mar 13, 2023 at 06:26:20AM -0700, Yves Orton wrote: Merged #20918 into blead.

Sorry, I'm a bit late with a code review. In short, it looks fine, but with the following niggles, listed by commit title. [By fine, I mostly mean the runtime parts - I'm just assuming the regcomp.c parts are ok too because I'm scared of anything in that file.] * regcomp.c - track parens related to CURLYX and CURLYM In Perl_rxres_save(), it changes (twice) Newx()s RX_NPARENS(rx)_1) * 2 to '* 4', but I assume that that is vestigial from the removed start/end changes, and needs to be changed back to '* 2' for now. t/re/re_tests: adds a comment: 'was bar-foo-bar prior to 5.37.7' presumably this should say 5.37.10 ? * regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers IMHO, the new CAPTURE_CLEAR() macro would be better implemented as an inline static function. Easier to debug, single step through etc. * regexec.c - incredibly inefficient solution to backref problem In the curlyx struct part of the union, both cp and lastcp are commented as: /* remember current savestack index */ Presumably it should be only one or the other? (Other structs in the union comment on either cp or lastcp or neither. Its all a bit inconsistent, and doesn't make it clear what the difference is between cp and lastcp fields) * regexec.c - make REF into a backtracking state In regcomp.h, the comment above RE_PESSIMISTIC_PARENS doesn't make it clear enough (to me, anyway, without also reading the commit message) that enabling it *guarantees correct* (albeit very slow) behaviour, while disabling it *optimises away* unnecessary saving of paren state, but might contain bugs. Or to put it another way, its not clear what "aggressive" means in that context. In regcomp_internal.h, this added define has no comment to explain what it is or does: #define VOLATILE_REF 1

…

-- You're only as old as you look.

demerphq · 2023-03-18T12:57:00Z

Sorry, I'm a bit late with a code review. In short, it looks fine, but with the following niggles, listed by commit title. [By fine, I mostly mean the runtime parts - I'm just assuming the regcomp.c parts are ok too because I'm scared of anything in that file.]

See #20940 which addresses the issues you have raised here.

demerphq requested review from khwilliamson, iabyn and tonycoz March 9, 2023 10:43

khwilliamson reviewed Mar 12, 2023

View reviewed changes

khwilliamson reviewed Mar 13, 2023

View reviewed changes

demerphq added 4 commits March 13, 2023 09:30

regexp.h - fixup mistake in comment

2b5d6c6

rex->maxlen holds the maximum length the pattern can match, not the minimum. The copy was obviously copied from the rex->minlen case, so fix it to be correct.

demerphq force-pushed the yves/redo_curlyx_curlym branch from 8cd2da2 to 98a9b34 Compare March 13, 2023 08:31

demerphq added 16 commits March 13, 2023 10:14

regcomp.c - extend REF to hold the paren it needs to regcppush

c41e6a0

this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.

regexp.h - standardize macros, and parenthesize parameters

b82f4ce

Obviously this isn't required as we build fine. But doing this future proofs us to other changes.

regexec.c - use RXp_LASTPAREN(rex) to access rex->lastparen

1065670

This field will be moving to a new struct. Converting this to a macro will make that move easier.

regexp.h - add missing defines

7568ad6

We were missing various RXp_XXXX() and RX_XXXX() macros. This adds them so we can use them in places where we are unreasonable intimate with the regexp struct internals.

dump.c - use RXp_ macros to access regexp struct members

4a94860

We will move some of these members out of the regexp structure into a new sub structucture. This isolates those changes to the macro definitions

regexec.c - use RXp_LASTCLOSEPAREN(r) to access r->lastcloseparen

e8a5523

We will move this struct member into a new struct in a future patch, and using the macros means we can reduce the number of places that needs to be explcitly aware of the new structure.

regexec.c - use macro to access rex->subbeg

3697eb9

We will move this member to a new struct in the near future, converting all uses to a macro isolates that change.

regexec.c - use RXp_SUBLEN(ret) for ret->sublen

0bce65a

This member of the regexp structure will be moved to a new structure in the near future. Converting to use the macro will make this change easier to manage.

regexec.c - use RXp_SUBOFFSET(rx) instead of rx->suboffset

3a05036

We will migrate this struct member to a new struct in the near future this change will make that patch more minimal and hide the gory details.

regexec.c - use RXp_SUBCOFFSET instead of rx->subcoffset

eb05738

This member of the regexp struct will soon be migrated to a new independent structure. This change ensure that when we do the migration the changes are restricted to the least code possible.

regexec.c - use RXp_SAVED_COPY(rex) instead of rex->saved_copy

04d422e

We will migrate this member to a new structure in the near future, wrapping with a macro makes that migration simpler and less invasive.

demerphq force-pushed the yves/redo_curlyx_curlym branch from 98a9b34 to e9a6ab1 Compare March 13, 2023 09:16

demerphq merged commit 6b6b368 into blead Mar 13, 2023

demerphq deleted the yves/redo_curlyx_curlym branch March 13, 2023 13:26

demerphq mentioned this pull request Mar 18, 2023

Various regex engine minor changes for @iabyn #20940

Merged

richardleach mentioned this pull request Feb 8, 2025

Ancient Regex Regression #16616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regex bug fixes, refactoring (macroization) and code improvments #20918

Regex bug fixes, refactoring (macroization) and code improvments #20918

Uh oh!

demerphq commented Mar 9, 2023

Uh oh!

khwilliamson left a comment •

edited

Loading

Uh oh!

khwilliamson left a comment

Uh oh!

khwilliamson commented Mar 13, 2023

Uh oh!

demerphq commented Mar 13, 2023

Uh oh!

iabyn commented Mar 17, 2023 via email

Uh oh!

demerphq commented Mar 18, 2023

Uh oh!

Uh oh!

Regex bug fixes, refactoring (macroization) and code improvments #20918

Regex bug fixes, refactoring (macroization) and code improvments #20918

Uh oh!

Conversation

demerphq commented Mar 9, 2023

Uh oh!

khwilliamson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khwilliamson left a comment

Choose a reason for hiding this comment

Uh oh!

khwilliamson commented Mar 13, 2023

Uh oh!

demerphq commented Mar 13, 2023

Uh oh!

iabyn commented Mar 17, 2023 via email

Uh oh!

demerphq commented Mar 18, 2023

Uh oh!

Uh oh!

khwilliamson left a comment •

edited

Loading