WIP: split out matched offset data from regexp structure; create SVt_RXMO for it. #20747

demerphq · 2023-01-28T06:40:11Z

This is a work in progress. It is on top of the patches in yves/curlyx_curlym. The idea is to speed up matching by detaching the match results from the regexp itself, ultimately getting rid of the mother_re concept.

This means we have to "reverse" the relationship between pmops' and regexp and the last successful match data they contain so that the pmop contains a RXMO structure, which points at the regexp it is for.

Very much a work in progress.

tonycoz · 2023-01-30T00:07:59Z

dump.c

@@ -70,11 +70,23 @@ static const char* const svshorttypenames[SVt_LAST] = {
    "IO"
 };

+static const char* const unknowntypename = "UNKNOWN";


Better as

static const char unknowntypename[] = "UNKNOWN";

ack. thanks!

tonycoz · 2023-01-30T00:23:21Z

You might already be heading there, but I was wondering if this could/would lead to having a user visible match result object?

my $re = qr/.../;
if (my $match = $re->match("somestring")) { # optional start parameter
  # illustrative, not normative
  # perhaps look at existing 
  print $re->text; # $&
  print $re->text(1); # $1 etc
  print $re->start; # $-[0]
  # and more
}

Python does something similar: https://docs.python.org/3/library/re.html#match-objects

demerphq · 2023-01-30T00:55:58Z

It has definitely occurred to me yes but for the short term I am not planning to do this as it introduces a layer of additional complexity related to SvPVLV which I see in the regexp code which frankly I don't entirely understand and have been trying to avoid needing to understand. :-) As you can see from the failing tests here this is already a bit hairy.

But yes, I have that it would be a natural next step to look into once I can get this working. I haven't spent anytime at all thinking about what the api would look like tho. I guess a modifier that caused the match to return an object would be possible, as would a function which returned the match results for PL_curpm. If you or anyone else has ideas maybe we could open an issue to collect them? (I dont think comments on this PR would be the right place.) I guess we could look at python for precedent there. It also implies we probably need to bless these objects and etc, which I also haven't thought much about. Right now having them be purely internal objects like SVt_INVLIST saves me having to worry about a lot of possible issues that might come up, but I certainly recognize the potential these new objects have.

tonycoz · 2023-01-30T04:06:40Z

The RXMO doesn't necessarily need to be blessed, or even visible to the perl user, it could be done as a wrapper that hides the RXMO.

But as you say, that comes later.

iabyn · 2023-01-30T16:16:21Z

On Sun, Jan 29, 2023 at 04:56:15PM -0800, Yves Orton wrote: It has definitely occurred to me yes but for the short term I am not planning to do this as it introduces a layer of additional complexity related to SvPVLV which I see in the regexp code which frankly I don't entirely understand and have been trying to avoid needing to understand.

You may find useful, the 'Background' section in the commit message for commit df6b4bd Author: David Mitchell ***@***.***> AuthorDate: Fri Jul 7 14:13:32 2017 +0100 give REGEXP SVs the POK flag again

…

-- My get-up-and-go just got up and went.

CURLYX doesn't reset capture buffers properly. It is possible for multiple buffers to be defined at once with values from different iterations of the loop, which doesn't make sense really. An example is this: "foobarfoo"=~/((foo)|(bar))+/ after this matches $1 should equal $2 and $3 should be undefined, or $1 should equal $3 and $2 should be undefined. Prior to this patch this would not be the case. The solution that this patches uses is to introduce a form of "layered transactional storage" for paren data. The existing pair of start/end data for capture data is extended with a start_new/end_new pair. When the vast majority of our code wants to check if a given capture buffer is defined they first check "start_new/end_new", if either is -1 then they fall back to whatever is in start/end. When a capture buffer is CLOSEd the data is written into the start_new/end_new pair instead of the start/end pair. When a CURLYX loop is executing and has matched something (at least one "A" in /A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new data by writing it into start/end. When we begin a new iteration of the loop we clear the start_new/end_new pairs that are contained by the loop, by setting them to -1. If the loop fails then we roll back as we used to. If the loop succeeds we continue. When we hit an END block we commit everything. Consider the example above. We start off with everything set to -1. $1 = (-1,-1):(-1,-1) $2 = (-1,-1):(-1,-1) $3 = (-1,-1):(-1,-1) In the first iteration we have matched "foo" and end up with this: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 0, 3) $3 = (-1,-1):(-1,-1) We commit the results of $2 and $3, and then clear the new data in the beginning of the next loop: $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):(-1,-1) We then match "bar": $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):( 3, 7) and then commit the result and clear the new data: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):(-1,-1) $3 = ( 3, 7):(-1,-1) and then we match "foo" again: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 7,10) $3 = ( 3, 7):(-1,-1) And we then commit. We do a regcppush here as normal. $1 = (-1,-1):( 0, 3) $2 = ( 7,10):( 7,10) $3 = (-1,-1):(-1,-1) We then clear it again, but since we don't match when we regcppop we store the buffers back to the above layout. When we finally hit the END buffer we also do a commit as well on all buffers, including the 0th (for the full match). Fixes GH Issue #18865, and adds tests for it and other things.

In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well.

capture buffer semantics should now be consistent.

Backrefs to unclosed parens inside of a quantified group were not being properly handled, which revealed we are not unrolling the paren state properly on failure and backtracking. Much of the code assumes that when we execute a "conditional" operation (where more than one thing could match) that we need not concern ourself with the paren state unless the conditional operation itself represents a paren, and that generally opcodes only needed to concern themselves with parens to their right. When you exclude backrefs from the equation this is broadly reasonable (i think), as on failure we typically dont care about the state of the paren buffers. They either get reset as we find a new different accepting pathway, or their state is irrelevant if the overal match is rejected (eg it fails). However backreferences are different. Consider the following pattern from the tests "xa=xaaa" =~ /^(xa|=?\1a){2}\z/ in the first iteration through this the first branch matches, and in fact because the \1 is in the second branch it can't match on the first iteration at all. After this $1 = "xa". We then perform the second iteration. "xa" does not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets up a backtracking action to not match if the rest of the pattern does not match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in the string, we exit the quantifier loop with $1 = "=xaa" and match \z against the remaining "a" in the pattern, and fail. Here is where things go wrong in the old code, we unwind to the outer loop, but we do not unwind the paren state. We then unwind further into the 2nds iteration of the loop, to the '=?' where we then try to match the tail with the quantifier matching the empty string. We then match the old $1 (which was not unwound) as "=xaa", and then the "a" matches, and we are the end of the string and we have incorrectly accpeted this string as matching the pattern. What should have happend was when the \1 was resolved the second time it should have returned the same string as it did when the =? matched '=', which then would have resulted in the tail matching again, and etc, eventually unwinding the entire pattern when the second iteration failed entirely. This patch is very crude. It simple pushes the state of the parens and creates and unwind point for every case where we do a transition to a B or _next operation, and we make the corresponding _next_fail do the appropriate unwinding. The objective was to achieve correctness and then work towards making it more efficient. We almost certainly overstore items on the stack. In a future patch we can perhaps keep track of the unclosed parens before the relevant operators and make sure that they are properly pushed and unwound at the correct times.

This way we can do the required paren restoration only when it is in use. When we match a REF type node which is potentially a reference to an unclosed paren we push the match context information, currently for "everything", but in a future patch we can teach it to be more efficient by adding a new parameter to the REF regop to track which parens it should save. This converts the backtracking changes from the previous commit, so that it is run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which is by default 0. We don't make the new fields in the struct conditional as the stack frames are large and our changes don't make any real difference and it keeps things simpler to not have conditional members, especially since some of the structures have to line up with each other. If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means something is sensitive to us not necessarily restoring the parens properly on failure. We make some assumptions that the paren state after a failing state will be corrected by a future successful state, or that the state of the parens is irrelevant as we will fail anyway. This can be made not true by EVAL, backrefs, and potentially some other scenarios. Thus I have left this inefficient logic in place but guarded by the flag.

This eliminates the regnode_2L data structure, and merges it with the older regnode_2 data structure. At the same time it makes each "arg" property of the various regnode types that have one be consistently structured as an anonymous union like this: union { U32 arg1u; I32 arg2i; struct { U16 arg1a; U16 arg1b; }; }; We then expose four macros for accessing each slot: ARG1u() ARG1i() and ARG1a() and ARG1b(). Code then explicitly designates which they want. The old logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1, which was confusing to say the least. The regnode_2L structure had a U32 arg1, and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new set of macros we use the regnode_2 for both, and use the appropriate macros to show whether we want to signed or unsigned values. This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit args". However as each slot can also store two U16s, a regnode_3 can hold up to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count and ARG3a() and ARG3b() for parens before and inside the quantifier. It also changes the functions reganode() to reg1node() and changes reg2Lanode() to reg2node(). The 2L thing was just confusing.

this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.

I left a bit of debugging and commented out code in the PR. This removes or reworks that code to not run in production mode.

This insulates access to the regexp match offset data so we can fix the define later and move the offset structure into a new struct. The RXp_OFFSp() was introduced in a recent commit to deliberately break anything using RXp_OFFS() directly. It is hard to type deliberately, nothing but the internals should use it. Everything else should use one of the wrappers around it.

Obviously this isn't required as we build fine. But doing this future proofs us to other changes.

This field will be moving to a new struct. Converting this to a macro will make that move easier.

We were missing various RXp_XXXX() and RX_XXXX() macros. This adds them so we can use them in places where we are unreasonable intimate with the regexp struct internals.

We will move some of these members out of the regexp structure into a new sub structucture. This isolates those changes to the macro definitions

We will move this struct member into a new struct in a future patch, and using the macros means we can reduce the number of places that needs to be explcitly aware of the new structure.

We will move this member to a new struct in the near future, converting all uses to a macro isolates that change.

This member of the regexp structure will be moved to a new structure in the near future. Converting to use the macro will make this change easier to manage.

We will migrate this struct member to a new struct in the near future this change will make that patch more minimal and hide the gory details.

This member of the regexp struct will soon be migrated to a new independent structure. This change ensure that when we do the migration the changes are restricted to the least code possible.

We will migrate this member to a new structure in the near future, wrapping with a macro makes that migration simpler and less invasive.

We will move various members of the regexp structure to a new structure which just contains information about the match. Wrapping the members in the standard macros means that change can be made less invasive. We already did all of this in regexec.c

This member is moving out of the regexp structure and into a new structure in the very near future. Using the macro to access it minimizes the size of that change.

rex->maxlen holds the maximum length the pattern can match, not the minimum. The copy was obviously copied from the rex->minlen case, so fix it to be correct.

We have the data in dump.c but we don't seem to use it in very many places.

newSV_type() claimed it was actually sv_upgrade, which it isn't. Also the error message is less than helpful when someone is adding a new type. So show a better error message and distinguish between "I don't know how to handle this new type you have added" and "That type id is simply not valid" in both cases, with the correct C sub names in the error message.

github-actions bot added the hasConflicts label Jan 28, 2023

demerphq force-pushed the yves/split_matched_state_regexp_struct branch from 8186a6e to 8c37160 Compare January 29, 2023 15:01

tonycoz reviewed Jan 30, 2023

View reviewed changes

github-actions bot removed the hasConflicts label Jan 30, 2023

demerphq force-pushed the yves/split_matched_state_regexp_struct branch from 8c37160 to b2da3f6 Compare January 30, 2023 04:42

github-actions bot added the hasConflicts label Feb 10, 2023

demerphq added 20 commits February 20, 2023 16:10

perldelta - add note about regex engine changes

ce6c76c

capture buffer semantics should now be consistent.

regcomp.c - extend REF to hold the paren it needs to regcppush

f1583c5

this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.

regexec.c - minor cleanup of CAPTURE_xxx code

a1fa545

I left a bit of debugging and commented out code in the PR. This removes or reworks that code to not run in production mode.

regexp.h - standardize macros, and parenthesize parameters

5067a17

Obviously this isn't required as we build fine. But doing this future proofs us to other changes.

regexec.c - use RXp_LASTPAREN(rex) to access rex->lastparen

ebf4910

This field will be moving to a new struct. Converting this to a macro will make that move easier.

regexp.h - add missing defines

5025ef1

We were missing various RXp_XXXX() and RX_XXXX() macros. This adds them so we can use them in places where we are unreasonable intimate with the regexp struct internals.

dump.c - use RXp_ macros to access regexp struct members

293ea33

We will move some of these members out of the regexp structure into a new sub structucture. This isolates those changes to the macro definitions

regexec.c - use RXp_LASTCLOSEPAREN(r) to access r->lastcloseparen

d3ce0bb

We will move this struct member into a new struct in a future patch, and using the macros means we can reduce the number of places that needs to be explcitly aware of the new structure.

dump.c - fixup missing case

75038a3

regexec.c - use macro to access rex->subbeg

bdf4ee9

We will move this member to a new struct in the near future, converting all uses to a macro isolates that change.

regexec.c - use RXp_SUBLEN(ret) for ret->sublen

a7ca58f

This member of the regexp structure will be moved to a new structure in the near future. Converting to use the macro will make this change easier to manage.

regexec.c - use RXp_SUBOFFSET(rx) instead of rx->suboffset

53e1711

We will migrate this struct member to a new struct in the near future this change will make that patch more minimal and hide the gory details.

regexec.c - use RXp_SUBCOFFSET instead of rx->subcoffset

ac9345c

This member of the regexp struct will soon be migrated to a new independent structure. This change ensure that when we do the migration the changes are restricted to the least code possible.

regexec.c - use RXp_SAVED_COPY(rex) instead of rex->saved_copy

1d8e83a

We will migrate this member to a new structure in the near future, wrapping with a macro makes that migration simpler and less invasive.

demerphq added 11 commits February 20, 2023 16:14

regexp.h - use RXp_SAVED_COPY(ret) to access ret->saved_copy

ef17607

This member is moving out of the regexp structure and into a new structure in the very near future. Using the macro to access it minimizes the size of that change.

regexp.h - fixup mistake in comment

45d5d5b

rex->maxlen holds the maximum length the pattern can match, not the minimum. The copy was obviously copied from the rex->minlen case, so fix it to be correct.

WIP

831f563

WIP - add SVt_RXMO to store regex match offsets

d8af754

dump.c - add a function that returns the name for a SVt_ type

685b7f3

We have the data in dump.c but we don't seem to use it in very many places.

WIP - RXMO works-ish

a9990f0

WIP

ef26fa4

WIP - _XPV_HEAD

16779c6

WIP

f835956

demerphq force-pushed the yves/split_matched_state_regexp_struct branch from b2da3f6 to f835956 Compare February 20, 2023 15:27

demerphq removed the hasConflicts label Feb 20, 2023

github-actions bot added the hasConflicts label Feb 25, 2023

demerphq mentioned this pull request Mar 9, 2023

Regex bug fixes, refactoring (macroization) and code improvments #20918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: split out matched offset data from regexp structure; create SVt_RXMO for it. #20747

WIP: split out matched offset data from regexp structure; create SVt_RXMO for it. #20747

Uh oh!

demerphq commented Jan 28, 2023

Uh oh!

tonycoz Jan 30, 2023

Uh oh!

demerphq Jan 30, 2023

Uh oh!

tonycoz commented Jan 30, 2023

Uh oh!

demerphq commented Jan 30, 2023

Uh oh!

tonycoz commented Jan 30, 2023

Uh oh!

iabyn commented Jan 30, 2023 via email

Uh oh!

Uh oh!

WIP: split out matched offset data from regexp structure; create SVt_RXMO for it. #20747

Are you sure you want to change the base?

WIP: split out matched offset data from regexp structure; create SVt_RXMO for it. #20747

Uh oh!

Conversation

demerphq commented Jan 28, 2023

Uh oh!

tonycoz Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

demerphq Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

tonycoz commented Jan 30, 2023

Uh oh!

demerphq commented Jan 30, 2023

Uh oh!

tonycoz commented Jan 30, 2023

Uh oh!

iabyn commented Jan 30, 2023 via email

Uh oh!

Uh oh!