-
Notifications
You must be signed in to change notification settings - Fork 584
fixup many regexp bugs - not to be merged yet #20677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
a5a79a2
to
e169495
Compare
1d6325f
to
f3d56ea
Compare
This now works as I believe is expected, and matches PCRE etc. I am pretty sure it is nothing as elegant and efficient as it could be using the regexp state machine, but I do believe it is now correct. You have to start somewhere. :-) The important gist of the changes are: The offset data has been doubled to hold a "start/end" and "start_new/end_new" pairs. OPEN/CLOSE etc update start_new/end_new. All of the buffer read logic has been changed to check both, first the "_new" variants, then the standard ones. At the and of a successful iteration of the quantifier we "commit" the state of the _new data to the standard data. Prior to a new iteration we clear the _new data. All CURLY style operators now store the min/max as I32's and they store the "first paren" and "last paren" of their contents. All branches now store the number of parens before them, and contained by them. As do tries. After a failed branch we clear the _new data for the parens it contains. (As well as whatever else we did historically). There are a few other minor changes embedded into these patches which i will tease out, but that is the main point. @iabyn if you have suggestions on how to make it more efficient I would love to hear them. |
@jkeenan why did you add a "do not merge" to this /draft/ PR? Doesn't that kinda defeat the point of it being draft? |
On 1/7/23 17:04, Yves Orton wrote:
@jkeenan <https://github.com/jkeenan> why did you add a "do not merge"
to this /draft/ PR? Doesn't that kinda defeat the point of it being draft?
I don't think you need "do not merge" in the Subject. But do whatever
you want.
|
f3d56ea
to
869400a
Compare
@kid51 the point is that is a note to me, before I change it from draft ill probably split some of those patches apart, and I will certainly remove a bunch of debug crap. I wont change it from draft until I have done so.I thought do-not-merge is for ready PR's that we are sitting on till a later release cycle. Anyway, it doesn't matter, it just struck me as redundant. |
3a5f7a8
to
fc396c8
Compare
I've had a look over it, but I'm not especially familiar with the regexp engine. |
Your feedback is already super useful, thanks. I need to think on what to do about |
On Sat, Jan 07, 2023 at 10:01:15AM -0800, Yves Orton wrote:
The offset data has been doubled to hold a "start/end" and
"start_new/end_new" pairs. OPEN/CLOSE etc update start_new/end_new. All
of the buffer read logic has been changed to check both, first the
"_new" variants, then the standard ones. At the and of a successful
iteration of the quantifier we "commit" the state of the _new data to
the standard data. Prior to a new iteration we clear the _new data. All
CURLY style operators now store the min/max as I32's and they store the
"first paren" and "last paren" of their contents.
I have zero enthusiasm to review this currently. This is because
its a big patch, with practically no explanation of what fundamental
problems it's trying to fix, and what its doing to fix them.
…--
The optimist believes that he lives in the best of all possible worlds.
As does the pessimist.
|
@iabyn - fair enough. I thought I had explained well enough in the PR documentation. I didnt bother in the commits because I am fairly certain I am not doing things efficiently and I would like your advice on how to do them efficiently so i expect to rip them up after I get some of your advice. What I have written in the text to this PR is more relevant generally than the code. But clearly I have not explained myself sufficiently. I am trying to solve the issues of clearing capture buffers properly, and setting them properly inside of quantified expressions, especially I am trying to resolve the issues that come from CURLYX and CURLYN and CURLYM not being equivalent. The code says that converting CURLYX to CURLYM or CURLYN is a "powerful optimization", which is correct in a sense, but it is not purely an optimization as it turns out CURLYM and CURLN do things "right" as far as capture buffers go, CURLYX does not. Which is problematic, as we disable these optimizations under various circumstances. Meaning a minor change to the pattern can result in the state of the capture buffers afterwards being different. Essentially there are two core bugs here. They both are solved using the same underlying machinery, teaching the relevant regops to know how many parens are bfore them, and how many are defined within them, a macro called CAPTURE_CLEAR() and a macro called CAPTURE_COMMIT(), and the addition of "transactional transparent capture buffers" which are committed only when a quantifier iteration is successful. BUG 1
This should not leave $2 AND $3 set at the same time. Either $3 should be undef, and $1 should equal $2. Or, $2 should be undef and $1 should equal $3. It should work like the branch does:
SOLUTION 1 For 1 this means that struct regexp_paren_pair gets two new members, start_new and end_new. Like this:
The logic that reads offset data has been changed to know about the start_new,end_new variants, and to check them first. When they are -1 the code then looks at the start,end pair. Thus the _new buffers are "transparent", when not set you can see the old value. They are transactional because the OPEN/CLOSE logic writes into the _new buffers, and the value of the _new buffer is only "committed" to the standard buffer at certain times (WHILEM, END and GOSUB). Each loop of a quantifier is treated as a transaction. Either it succeeds in which case the state is committed, or it fails in which case it is cleared. The overall pattern is treated as a loop as with 1 iteration, so END is also a commit. So for instance in With this CURLYX and CURLYM and CURLYN behave the same. BUG 2
This should result in either $2 $3 and $4 being set, or $5 and $6, or $7. And $1 should either equal "$2$3$4" or "$5$6" or $7. Like this:
SOLUTION 2 For problem 2, we do something similar. Each BRANCH knows how many capture buffers are contained in its branch, and on failure clears those capture buffers (again by using CAPTURE_CLEAR() and clearing the _new state before they execute their payload. so in /(a)(b)(c)|(a)(b)|(a)/ we end up with: BRANCH(0,3) OPEN1 ... OPEN2 ... OPEN3 BRANCH(3,5) OPEN4 ... OPEN5 ... BRANCH(5,6) OPEN6 ... END So the first branch when it fails clears buffer 1..3, the second when it fails clears 4..5, and when the third branch fails it clears 6. The end result is that the state of capture buffers in branches is mutexed. WHAT I NEED FROM YOU |
fc396c8
to
5d916aa
Compare
b64a8fc
to
b54080b
Compare
@PhilipHazel could you please check this over? This is an attempt to make Perls capture buffers behave more consistently. It hopefully brings PCRE and Perl into agreement in more cases. |
aaf1a95
to
b0d3d77
Compare
b0d3d77
to
f04388d
Compare
f04388d
to
0f5d077
Compare
0f5d077
to
7956b2e
Compare
7956b2e
to
dae4eb5
Compare
dae4eb5
to
f5f4af4
Compare
CURLYX doesn't reset capture buffers properly. It is possible for multiple buffers to be defined at once with values from different iterations of the loop, which doesn't make sense really. An example is this: "foobarfoo"=~/((foo)|(bar))+/ after this matches $1 should equal $2 and $3 should be undefined, or $1 should equal $3 and $2 should be undefined. Prior to this patch this would not be the case. The solution that this patches uses is to introduce a form of "layered transactional storage" for paren data. The existing pair of start/end data for capture data is extended with a start_new/end_new pair. When the vast majority of our code wants to check if a given capture buffer is defined they first check "start_new/end_new", if either is -1 then they fall back to whatever is in start/end. When a capture buffer is CLOSEd the data is written into the start_new/end_new pair instead of the start/end pair. When a CURLYX loop is executing and has matched something (at least one "A" in /A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new data by writing it into start/end. When we begin a new iteration of the loop we clear the start_new/end_new pairs that are contained by the loop, by setting them to -1. If the loop fails then we roll back as we used to. If the loop succeeds we continue. When we hit an END block we commit everything. Consider the example above. We start off with everything set to -1. $1 = (-1,-1):(-1,-1) $2 = (-1,-1):(-1,-1) $3 = (-1,-1):(-1,-1) In the first iteration we have matched "foo" and end up with this: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 0, 3) $3 = (-1,-1):(-1,-1) We commit the results of $2 and $3, and then clear the new data in the beginning of the next loop: $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):(-1,-1) We then match "bar": $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):( 3, 7) and then commit the result and clear the new data: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):(-1,-1) $3 = ( 3, 7):(-1,-1) and then we match "foo" again: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 7,10) $3 = ( 3, 7):(-1,-1) And we then commit. We do a regcppush here as normal. $1 = (-1,-1):( 0, 3) $2 = ( 7,10):( 7,10) $3 = (-1,-1):(-1,-1) We then clear it again, but since we don't match when we regcppop we store the buffers back to the above layout. When we finally hit the END buffer we also do a commit as well on all buffers, including the 0th (for the full match). Fixes GH Issue #18865, and adds tests for it and other things.
f5f4af4
to
dd09cdb
Compare
There is a way to reset the capturing buffers in O(1) upon backtracking and avoid calling Besides the capture array base and capacity fields, it would require a OPEN could simply push a new capture struct on the catpure stack and store the start position in it as if every alternative was inside a branch reset group Upon backtracking, it would suffice to reset At the end of a successful match, the array The description above is incomplete and awkward to fully put into words so here is the complete semantic description:
|
This fixes a number of longstanding regexp bugs and issues. It includes debugging changes and other stuff i will rebase away.
The main point is that we use a two layer model to represent capture buffers, and implement a concept of "commiting" the changes, and we store additional data in the CURLY nodes so we can manage the parens they contain properly. This is likely not the most efficient way to do this. The idea was to be correct first, then optimize.
The intended semantics are that: when we have a quantified group we keep track of the parens that it contains, and we ensure that all capture buffers it does contain come from a single iteration of the quantifier.
Consider
In older perls:
You can see that in older perls that $2 is from the second iteration, and $3 is from the third.
This patch currently doesn't entirely fix this (yet):
where arguably $2 and $5 should be empty, as they werent part of the last succesful match. Currently I have only fixed CURLYX -> CURLYM. And what should happen here is a bit controversial, you could argue this is correct.
However even for this case we have a change from older perls:
which really highlights the underlying bug, we have capture buffers here from the first, second and third iteration of the +. IMO that doesn't make sense, and it is inconsistent with the behavior of capture buffers in other contexts.
This PR patch also fixes some other related issues, for instance the min and max have been changed and REG_INFINITY is now I32_MAX, as opposed to U16_MAX. This means that CURLYM and CURLYX now behave the same.
This is intended to fix #20661 and related bugs.
Fixes #20661
Fixes #8267
Fixes #19615
Fixes #18865
Fixes #14848
Fixes #13619
Fixes #10073