Skip to content

fixup many regexp bugs - not to be merged yet #20677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

demerphq
Copy link
Collaborator

@demerphq demerphq commented Jan 6, 2023

This fixes a number of longstanding regexp bugs and issues. It includes debugging changes and other stuff i will rebase away.

The main point is that we use a two layer model to represent capture buffers, and implement a concept of "commiting" the changes, and we store additional data in the CURLY nodes so we can manage the parens they contain properly. This is likely not the most efficient way to do this. The idea was to be correct first, then optimize.

The intended semantics are that: when we have a quantified group we keep track of the parens that it contains, and we ensure that all capture buffers it does contain come from a single iteration of the quantifier.

Consider

./perl -le'"barfoobar"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&";'
bar--bar=barfoobar

In older perls:

perl -le'"barfoobar"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&";'
bar-foo-bar=barfoobar

You can see that in older perls that $2 is from the second iteration, and $3 is from the third.

This patch currently doesn't entirely fix this (yet):

./perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&";'
a|a--|a-|a=abcaba

where arguably $2 and $5 should be empty, as they werent part of the last succesful match. Currently I have only fixed CURLYX -> CURLYM. And what should happen here is a bit controversial, you could argue this is correct.

However even for this case we have a change from older perls:

perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&";'
a|a-b-c|a-b|a=abcaba

which really highlights the underlying bug, we have capture buffers here from the first, second and third iteration of the +. IMO that doesn't make sense, and it is inconsistent with the behavior of capture buffers in other contexts.

This PR patch also fixes some other related issues, for instance the min and max have been changed and REG_INFINITY is now I32_MAX, as opposed to U16_MAX. This means that CURLYM and CURLYX now behave the same.

This is intended to fix #20661 and related bugs.

Fixes #20661
Fixes #8267
Fixes #19615
Fixes #18865
Fixes #14848
Fixes #13619
Fixes #10073

@demerphq demerphq requested a review from khwilliamson January 6, 2023 16:42
@demerphq
Copy link
Collaborator Author

demerphq commented Jan 6, 2023

#20661

@demerphq demerphq requested review from hvds and iabyn January 6, 2023 16:45
@demerphq demerphq force-pushed the yves/curlyx_curlym branch from a5a79a2 to e169495 Compare January 6, 2023 16:52
@demerphq demerphq requested a review from tonycoz January 6, 2023 17:09
@demerphq demerphq force-pushed the yves/curlyx_curlym branch 2 times, most recently from 1d6325f to f3d56ea Compare January 7, 2023 17:23
@demerphq
Copy link
Collaborator Author

demerphq commented Jan 7, 2023

This now works as I believe is expected, and matches PCRE etc. I am pretty sure it is nothing as elegant and efficient as it could be using the regexp state machine, but I do believe it is now correct. You have to start somewhere. :-)

The important gist of the changes are:

The offset data has been doubled to hold a "start/end" and "start_new/end_new" pairs. OPEN/CLOSE etc update start_new/end_new. All of the buffer read logic has been changed to check both, first the "_new" variants, then the standard ones. At the and of a successful iteration of the quantifier we "commit" the state of the _new data to the standard data. Prior to a new iteration we clear the _new data. All CURLY style operators now store the min/max as I32's and they store the "first paren" and "last paren" of their contents.

All branches now store the number of parens before them, and contained by them. As do tries. After a failed branch we clear the _new data for the parens it contains. (As well as whatever else we did historically).

There are a few other minor changes embedded into these patches which i will tease out, but that is the main point. @iabyn if you have suggestions on how to make it more efficient I would love to hear them.

@jkeenan jkeenan added the do not merge Don't merge this PR, at least for now label Jan 7, 2023
@demerphq
Copy link
Collaborator Author

demerphq commented Jan 7, 2023

@jkeenan why did you add a "do not merge" to this /draft/ PR? Doesn't that kinda defeat the point of it being draft?

@jkeenan
Copy link
Contributor

jkeenan commented Jan 7, 2023 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Jan 7, 2023

@kid51 the point is that is a note to me, before I change it from draft ill probably split some of those patches apart, and I will certainly remove a bunch of debug crap. I wont change it from draft until I have done so.I thought do-not-merge is for ready PR's that we are sitting on till a later release cycle. Anyway, it doesn't matter, it just struck me as redundant.

@demerphq demerphq force-pushed the yves/curlyx_curlym branch 2 times, most recently from 3a5f7a8 to fc396c8 Compare January 8, 2023 16:10
@tonycoz
Copy link
Contributor

tonycoz commented Jan 9, 2023

I've had a look over it, but I'm not especially familiar with the regexp engine.

@demerphq
Copy link
Collaborator Author

demerphq commented Jan 9, 2023

I've had a look over it, but I'm not especially familiar with the regexp engine.

Your feedback is already super useful, thanks. I need to think on what to do about (??{{ }}) and friends. I had hoped this trick with the extra block would Just Work. Now I need to think about this. How important is the local thing? If someone is using (?{{}}) maybe they dont want to use local. I have a feeling this is hopeless optimism and I just need to figure out an alternative. Which means messing with the toker, which I had really wanted to avoid. That code is gnarly.

@iabyn
Copy link
Contributor

iabyn commented Jan 9, 2023 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Jan 9, 2023

@iabyn - fair enough. I thought I had explained well enough in the PR documentation. I didnt bother in the commits because I am fairly certain I am not doing things efficiently and I would like your advice on how to do them efficiently so i expect to rip them up after I get some of your advice. What I have written in the text to this PR is more relevant generally than the code. But clearly I have not explained myself sufficiently.

I am trying to solve the issues of clearing capture buffers properly, and setting them properly inside of quantified expressions, especially I am trying to resolve the issues that come from CURLYX and CURLYN and CURLYM not being equivalent. The code says that converting CURLYX to CURLYM or CURLYN is a "powerful optimization", which is correct in a sense, but it is not purely an optimization as it turns out CURLYM and CURLN do things "right" as far as capture buffers go, CURLYX does not. Which is problematic, as we disable these optimizations under various circumstances. Meaning a minor change to the pattern can result in the state of the capture buffers afterwards being different.

Essentially there are two core bugs here. They both are solved using the same underlying machinery, teaching the relevant regops to know how many parens are bfore them, and how many are defined within them, a macro called CAPTURE_CLEAR() and a macro called CAPTURE_COMMIT(), and the addition of "transactional transparent capture buffers" which are committed only when a quantifier iteration is successful.

BUG 1

perl -le'"foobarfoo"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&"' #wrong
foo-foo-bar=foobarfoo 

This should not leave $2 AND $3 set at the same time. Either $3 should be undef, and $1 should equal $2. Or, $2 should be undef and $1 should equal $3. It should work like the branch does:

./perl -le'"foobarfoo"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&"' #right
foo-foo-=foobarfoo

SOLUTION 1
I have added the idea of "temporary offset data" which is "committed" when a WHILEM or END executes.. And I have taught CURLY style regops to store the number of parens before them, and the number which are contained by them so that the CURLY/WHILEM can reset them appropaitely each loop.

For 1 this means that struct regexp_paren_pair gets two new members, start_new and end_new. Like this:

typedef struct regexp_paren_pair {
    SSize_t start;
    SSize_t end;

    SSize_t start_new;
    SSize_t end_new;

    /* 'start_tmp' records a new opening position before the matching end
     * has been found, so that the old start and end values are still
     * valid, e.g.
     *    "abc" =~ /(.(?{print "[$1]"}))+/
     *outputs [][a][b]
     * This field is not part of the API.  */
    SSize_t start_tmp;
} regexp_paren_pair;

The logic that reads offset data has been changed to know about the start_new,end_new variants, and to check them first. When they are -1 the code then looks at the start,end pair. Thus the _new buffers are "transparent", when not set you can see the old value. They are transactional because the OPEN/CLOSE logic writes into the _new buffers, and the value of the _new buffer is only "committed" to the standard buffer at certain times (WHILEM, END and GOSUB).

Each loop of a quantifier is treated as a transaction. Either it succeeds in which case the state is committed, or it fails in which case it is cleared. The overall pattern is treated as a loop as with 1 iteration, so END is also a commit.

So for instance in "foobarfoo"=~/((foo)|(bar))+/ we first match foo, so offs[2].start_new = 0, offs[2].end_new=3, we hit the end of the + loop, and commit the data for $2 and $3, so now offs[2].start = 0, and offs[2].end = 3 as well, offs[3] stays -1 in both data. We then clear offs[2].start_new and offs[2].end_new by setting them to -1. We execute the loop again, and this time we match 'bar', so, we end up committing $2=(-1,-1) $3=(3,7);, We execute the loop again, and this time we mach 'foo' again, and so we end up committing $2=(7,10), $3=(-1,-1). We execute it a third time, which fails to match either, and we are done.

With this CURLYX and CURLYM and CURLYN behave the same.

BUG 2
The other problem I am trying to solve is capture buffers from multiple branches of a pattern being set after an alternation. The capture buffers defined in different branches of an alternation should be mutexed against each other (leaving aside branch reset patterns). So this is wrong:

perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&"'; # wrong
a|a-b-c|a-b|a=abcaba

This should result in either $2 $3 and $4 being set, or $5 and $6, or $7. And $1 should either equal "$2$3$4" or "$5$6" or $7. Like this:

./perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&"'; # right
a|--|-|a=abcaba

SOLUTION 2
I have added the idea that a BRANCH node should clear the buffers it contains when the branch fails. So I have taught the BRANCH logic to store the number of parens before them, and then number which are contained by them as well.

For problem 2, we do something similar. Each BRANCH knows how many capture buffers are contained in its branch, and on failure clears those capture buffers (again by using CAPTURE_CLEAR() and clearing the _new state before they execute their payload. so in /(a)(b)(c)|(a)(b)|(a)/ we end up with: BRANCH(0,3) OPEN1 ... OPEN2 ... OPEN3 BRANCH(3,5) OPEN4 ... OPEN5 ... BRANCH(5,6) OPEN6 ... END

So the first branch when it fails clears buffer 1..3, the second when it fails clears 4..5, and when the third branch fails it clears 6.

The end result is that the state of capture buffers in branches is mutexed.

WHAT I NEED FROM YOU
Everything else in this PR is a minor fix and I don't need your help with. What I really need is your insight into how to exploit the state machine to do the above stuff efficiently, and not have to loop over all the capture buffers for instance to clear or commit them. Or any other suggestions you might have to resolve the above bugs. I am satisfied my PR has the correct semantics and consistent behavior, I just need advice on how to make it efficient. I think maybe I can get rid of a bunch of this patch, maybe even most of it by using the information I have added to the regops in terms of buffers before and contained, to update maxopenparen properly. But I am not entirely sure if i can, GOSUB and other non-linear pattern constructs seem to suggest that wont work. Consider "abac" /((a)|(b))+(c)/, you cant just set maxopenparen to 2 after the last iteration of the loop, as when (c) matches it will get set back to 4, and if the buffer for $3 has not been cleared the value from the previous iteration of the quantifier loop will be visible again.

@demerphq
Copy link
Collaborator Author

I have updated the PR by breaking up the main changes in the PR. The two patches I would really like some feed back from @iabyn are ff43150
and 9942944

@demerphq
Copy link
Collaborator Author

@PhilipHazel could you please check this over? This is an attempt to make Perls capture buffers behave more consistently. It hopefully brings PCRE and Perl into agreement in more cases.

CURLYX doesn't reset capture buffers properly. It is possible
for multiple buffers to be defined at once with values from
different iterations of the loop, which doesn't make sense really.

An example is this:

  "foobarfoo"=~/((foo)|(bar))+/

after this matches $1 should equal $2 and $3 should be undefined,
or $1 should equal $3 and $2 should be undefined. Prior to this
patch this would not be the case.

The solution that this patches uses is to introduce a form of
"layered transactional storage" for paren data. The existing
pair of start/end data for capture data is extended with a
start_new/end_new pair. When the vast majority of our code wants
to check if a given capture buffer is defined they first check
"start_new/end_new", if either is -1 then they fall back to
whatever is in start/end.

When a capture buffer is CLOSEd the data is written into the
start_new/end_new pair instead of the start/end pair. When a CURLYX
loop is executing and has matched something (at least one "A" in
/A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new
data by writing it into start/end. When we begin a new iteration of
the loop we clear the start_new/end_new pairs that are contained by
the loop, by setting them to -1. If the loop fails then we roll back
as we used to. If the loop succeeds we continue. When we hit an END
block we commit everything.

Consider the example above. We start off with everything set to -1.

 $1 = (-1,-1):(-1,-1)
 $2 = (-1,-1):(-1,-1)
 $3 = (-1,-1):(-1,-1)

In the first iteration we have matched "foo" and end up with this:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 0, 3)
 $3 = (-1,-1):(-1,-1)

We commit the results of $2 and $3, and then clear the new data in
the beginning of the next loop:

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):(-1,-1)

We then match "bar":

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):( 3, 7)

and then commit the result and clear the new data:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):(-1,-1)
 $3 = ( 3, 7):(-1,-1)

and then we match "foo" again:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 7,10)
 $3 = ( 3, 7):(-1,-1)

And we then commit. We do a regcppush here as normal.

 $1 = (-1,-1):( 0, 3)
 $2 = ( 7,10):( 7,10)
 $3 = (-1,-1):(-1,-1)

We then clear it again, but since we don't match when we regcppop
we store the buffers back to the above layout. When we finally
hit the END buffer we also do a commit as well on all buffers, including
the 0th (for the full match).

Fixes GH Issue #18865, and adds tests for it and other things.
@florian-pe
Copy link

florian-pe commented May 8, 2024

@demerphq

WHAT I NEED FROM YOU
Everything else in this PR is a minor fix and I don't need your help with. What I really need is your insight into how to exploit the state machine to do the above stuff efficiently, and not have to loop over all the capture buffers for instance to clear or commit them. Or any other suggestions you might have to resolve the above bugs. I am satisfied my PR has the correct semantics and consistent behavior, I just need advice on how to make it efficient.

There is a way to reset the capturing buffers in O(1) upon backtracking and avoid calling memset(). Instead of writing the start and end positions of a captured substring in a random access manner at their "final" index slot in the capture array, it would simpler to use the capture array using a stack discipline, sort of.

Besides the capture array base and capacity fields, it would require a top field and a current field. Both of them would need to be saved in a newly pushed choice point/backtracking frame.

OPEN could simply push a new capture struct on the catpure stack and store the start position in it as if every alternative was inside a branch reset group (?| () | () ). Additionally, OPEN would also write the real capture number in the newly pushed capture struct. The real capture number would be stored in the OPEN regop struct.

Upon backtracking, it would suffice to reset top and current to their previous values. Then new capturing groups could safely override previous ones.

At the end of a successful match, the array @{^CAPTURE} would be constructed with each capture group stored at their correct array indexes. Similarly with $1, $2, etc..

The description above is incomplete and awkward to fully put into words so here is the complete semantic description:

# working example
1 2 2 3 3   4 4 1  # capture group real indexes
( ( ) ( ) | ( ) )
0 1 1 2 2   1 1 0  # capture array indexes

# order in which capture structs are written (considering no backtracking occurs)
( 0 1 2
) 1 2 0

REGEX TRY INIT {
    top = -1;
    current = -1;
}

// The variable pos refers to the current position in the string being matched.

OPEN(n) {
    top++;
    capture[top].start = pos;
    capture[top].end = -1; // in order for (*ACCEPT) to know that the capturing group has not been closed yet
    capture[top].num = n;
    capture[top].prev_current = current;
    current = top;
}

CLOSE {
    capture[current].end = pos;
    current = capture[current].prev_current;
}

BACKTRACK {
    top = last_choice_point->top;
    current = last_choice_point->current;
}

ACCEPT { // (*ACCEPT)
    for (int i=0; i <= top; i++) {
        if (capture[i].end == -1) {
            capture[i].end = pos;
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants