Skip to content

Various regex engine minor changes for @iabyn #20940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dist/Devel-PPPort/parts/embed.fnc
Original file line number Diff line number Diff line change
Expand Up @@ -523,7 +523,7 @@
:
: U autodoc.pl will not output a usage example
:
: W Add a _pDEPTH argument to function prototypes, and an _aDEPTH
: W Add a comma_pDEPTH argument to function prototypes, and a comma_aDEPTH
: argument to the function calls. This means that under DEBUGGING
: a depth argument is added to the functions, which is used for
: example by the regex engine for debugging and trace output.
Expand Down
9 changes: 8 additions & 1 deletion embed.fnc
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@
:
: 'U' autodoc.pl will not output a usage example
:
: 'W' Add a _pDEPTH argument to function prototypes, and an _aDEPTH argument
: 'W' Add a comma_pDEPTH argument to function prototypes, and a comma_aDEPTH argument
: to the function calls. This means that under DEBUGGING a depth
: argument is added to the functions, which is used for example by the
: regex engine for debugging and trace output. A non DEBUGGING build
Expand Down Expand Up @@ -5373,6 +5373,10 @@ ERS |WB_enum|backup_one_WB |NN WB_enum *previous \
|NN const U8 * const strbeg \
|NN U8 **curpos \
|const bool utf8_target
EWi |void |capture_clear |NN regexp *rex \
|U16 from_ix \
|U16 to_ix \
|NN const char *str
ERS |char * |find_byclass |NN regexp *prog \
|NN const regnode *c \
|NN char *s \
Expand Down Expand Up @@ -5463,6 +5467,9 @@ ERS |bool |regtry |NN regmatch_info *reginfo \
|NN char **startposp
ES |bool |to_byte_substr |NN regexp *prog
ES |void |to_utf8_substr |NN regexp *prog
EWi |void |unwind_paren |NN regexp *rex \
|U32 lp \
|U32 lcp
# if defined(DEBUGGING)
ES |void |debug_start_match \
|NN const REGEXP *prog \
Expand Down
10 changes: 6 additions & 4 deletions embed.h
Original file line number Diff line number Diff line change
Expand Up @@ -1933,6 +1933,7 @@
# define backup_one_LB(a,b,c) S_backup_one_LB(aTHX_ a,b,c)
# define backup_one_SB(a,b,c) S_backup_one_SB(aTHX_ a,b,c)
# define backup_one_WB(a,b,c,d) S_backup_one_WB(aTHX_ a,b,c,d)
# define capture_clear(a,b,c,d) S_capture_clear(aTHX_ a,b,c,d comma_aDEPTH)
# define find_byclass(a,b,c,d,e) S_find_byclass(aTHX_ a,b,c,d,e)
# define find_next_masked S_find_next_masked
# define find_span_end S_find_span_end
Expand All @@ -1945,18 +1946,19 @@
# define isSB(a,b,c,d,e,f) S_isSB(aTHX_ a,b,c,d,e,f)
# define isWB(a,b,c,d,e,f,g) S_isWB(aTHX_ a,b,c,d,e,f,g)
# define reg_check_named_buff_matched S_reg_check_named_buff_matched
# define regcp_restore(a,b,c) S_regcp_restore(aTHX_ a,b,c _aDEPTH)
# define regcppop(a,b) S_regcppop(aTHX_ a,b _aDEPTH)
# define regcppush(a,b,c) S_regcppush(aTHX_ a,b,c _aDEPTH)
# define regcp_restore(a,b,c) S_regcp_restore(aTHX_ a,b,c comma_aDEPTH)
# define regcppop(a,b) S_regcppop(aTHX_ a,b comma_aDEPTH)
# define regcppush(a,b,c) S_regcppush(aTHX_ a,b,c comma_aDEPTH)
# define reghop3 S_reghop3
# define reghop4 S_reghop4
# define reghopmaybe3 S_reghopmaybe3
# define reginclass(a,b,c,d,e) S_reginclass(aTHX_ a,b,c,d,e)
# define regmatch(a,b,c) S_regmatch(aTHX_ a,b,c)
# define regrepeat(a,b,c,d,e,f) S_regrepeat(aTHX_ a,b,c,d,e,f _aDEPTH)
# define regrepeat(a,b,c,d,e,f) S_regrepeat(aTHX_ a,b,c,d,e,f comma_aDEPTH)
# define regtry(a,b) S_regtry(aTHX_ a,b)
# define to_byte_substr(a) S_to_byte_substr(aTHX_ a)
# define to_utf8_substr(a) S_to_utf8_substr(aTHX_ a)
# define unwind_paren(a,b,c) S_unwind_paren(aTHX_ a,b,c comma_aDEPTH)
# if defined(DEBUGGING)
# define debug_start_match(a,b,c,d,e) S_debug_start_match(aTHX_ a,b,c,d,e)
# define dump_exec_pos(a,b,c,d,e,f,g) S_dump_exec_pos(aTHX_ a,b,c,d,e,f,g)
Expand Down
26 changes: 19 additions & 7 deletions perl.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,28 +30,40 @@

/*
=for apidoc_section $debugging
=for apidoc CmnW ||_aDEPTH
=for apidoc CmnW ||comma_aDEPTH
Some functions when compiled under DEBUGGING take an extra final argument named
C<depth>, indicating the C stack depth. This argument is omitted otherwise.
This macro expands to either S<C<, depth>> under DEBUGGING, or to nothing at
all when not under DEBUGGING, reducing the number of C<#ifdef>'s in the code.

The program is responsible for maintaining the correct value for C<depth>.

=for apidoc CyW ||_pDEPTH
This is used in the prototype declarations for functions that take a L</C<_aDEPTH>>
=for apidoc CyW ||comma_pDEPTH
This is used in the prototype declarations for functions that take a L</C<comma_aDEPTH>>
final parameter, much like L<C<pTHX_>|perlguts/Background and MULTIPLICITY>
is used in functions that take a thread context initial parameter.

=for apidoc CmnW ||debug_aDEPTH
Same as L</C<comma_aDEPTH>> but with no leading argument. Intended for functions with
no normal arguments, and used by L</C<comma_aDEPTH>> itself.

=for apidoc CmnW ||debug_pDEPTH
Same as L</C<comma_pDEPTH>> but with no leading argument. Intended for functions with
no normal arguments, and used by L</C<comma_pDEPTH>> itself.

=cut
*/

#ifdef DEBUGGING
# define _pDEPTH ,U32 depth
# define _aDEPTH ,depth
# define debug_pDEPTH U32 depth
# define comma_pDEPTH ,debug_pDEPTH
# define debug_aDEPTH depth
# define comma_aDEPTH ,debug_aDEPTH
#else
# define _pDEPTH
# define _aDEPTH
# define debug_aDEPTH
# define comma_aDEPTH
# define debug_pDEPTH
# define comma_pDEPTH
#endif

/* NOTE 1: that with gcc -std=c89 the __STDC_VERSION__ is *not* defined
Expand Down
5 changes: 5 additions & 0 deletions pod/perldiag.pod
Original file line number Diff line number Diff line change
Expand Up @@ -6575,6 +6575,11 @@ If you know that you have good reason to exceed the limit you can change
it by setting C<${^MAX_NESTED_EVAL_BEGIN_BLOCKS}> to a different value from
the default of 1000.

=item Too many capture groups (limit is %d) in regex m/%s/

(F) You have too many capture groups in your regex pattern. You need to rework
your pattern to use less capture groups.

=item Too many )'s

(A) You've accidentally run your script through B<csh> instead of Perl.
Expand Down
5 changes: 3 additions & 2 deletions pp_ctl.c
Original file line number Diff line number Diff line change
Expand Up @@ -379,11 +379,12 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
PERL_ARGS_ASSERT_RXRES_SAVE;
PERL_UNUSED_CONTEXT;

/* deal with regexp_paren_pair items */
if (!p || p[1] < RX_NPARENS(rx)) {
#ifdef PERL_ANY_COW
i = 7 + (RX_NPARENS(rx)+1) * 4;
i = 7 + (RX_NPARENS(rx)+1) * 2;
#else
i = 6 + (RX_NPARENS(rx)+1) * 4;
i = 6 + (RX_NPARENS(rx)+1) * 2;
#endif
if (!p)
Newx(p, i, UV);
Expand Down
18 changes: 14 additions & 4 deletions proto.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions regcomp.c
Original file line number Diff line number Diff line change
Expand Up @@ -4000,6 +4000,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp, U32 depth)
capturing_parens:
parno = RExC_npar;
RExC_npar++;
if (RExC_npar >= U16_MAX)
FAIL2("Too many capture groups (limit is %" UVuf ")", (UV)RExC_npar);

logical_parno = RExC_logical_npar;
RExC_logical_npar++;
if (! ALL_PARENS_COUNTED) {
Expand Down
54 changes: 52 additions & 2 deletions regcomp.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,62 @@

#define PERL_REGCOMP_H_

/* define this to 1 if you want to enable a really aggressive and inefficient
* paren cleanup during backtracking. We should pass test with this as 0. */
#ifndef RE_PESSIMISTIC_PARENS
/* Define this to 1 if you want to enable a really aggressive and
* inefficient paren cleanup during backtracking which should ensure
* correctness. Doing so should fix any bugs related to backreferences,
* at the cost of saving and restoring paren state far more than we
* necessarily must.
*
* When it is set to 0 we try to optimize away unnecessary save/restore
* operations which could potentially introduce bugs. We should pass our
* test suite with this as 0, but setting it to 1 might fix cases we do
* not currently test for. If setting this to 1 does fix a bug, then
* review the code related to storing and restoring paren state.
*
* See comment for VOLATILE_REF below for more details of a
* related case.
*/
#define RE_PESSIMISTIC_PARENS 0
#endif

/* a VOLATILE_REF is a ref which is inside of a capturing group and it
* refers to the capturing group it is inside of or to a following capture
* group which might be affected by what this capture group matches, and
* thus the ref requires additional backtracking support. For example:
*
* "xa=xaaa" =~ /^(xa|=?\1a){2}\z/
*
* should not match. In older perls the matching process would go like this:
*
* Iter 1: "xa" matches in capture group.
* Iter 2: "xa" does not match, goes to next alternation.
* "=" matches in =?
* Bifurcates here (= might not match)
* "xa" matches via \1 from previous iteration
* "a" matches via "a" at end of second alternation
* # at this point $1 is "=xaa"
* \z does not match -> backtracks.
* Backtracks to Iter 2 "=?" Bifurcation point where we have NOT matched "="
* "=xaa" matches via \1 (as $1 has not been reset)
* "a" matches via "a" at end of second alternation
* "\z" does match. -> Pattern matches overall.
*
* What should happen and now does happen instead is:
*
* Backtracks to Iter 2 "=?" Bifurcation point where we have NOT matched "=",
* \1 does not match as it is "xa" (as $1 was reset when backtracked)
* and the current character in the string is an "="
*
* The fact that \1 in this case is marked as a VOLATILE_REF is what ensures
* that we reset the capture buffer properly.
*
* See 59db194299c94c6707095797c3df0e2f67ff82b2
* and 38508ce8fc3a1bd12a3bb65e9d4ceb9b396a18db
* for more details.
*/
#define VOLATILE_REF 1

#include "regcharclass.h"

/* Convert branch sequences to more efficient trie ops? */
Expand Down
2 changes: 0 additions & 2 deletions regcomp_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -1258,6 +1258,4 @@ static const scan_data_t zero_scan_data = {
#define REGNODE_STEP_OVER(ret,t1,t2) \
NEXT_OFF(REGNODE_p(ret)) = ((sizeof(t1)+sizeof(t2))/sizeof(regnode))

#define VOLATILE_REF 1

#endif /* REGCOMP_INTERNAL_H */
4 changes: 2 additions & 2 deletions regen/embed.pl
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ sub generate_proto_h {
else {
$ret .= "void" if !$has_context;
}
$ret .= " _pDEPTH" if $has_depth;
$ret .= " comma_pDEPTH" if $has_depth;
$ret .= ")";
my @attrs;
if ( $flags =~ /r/ ) {
Expand Down Expand Up @@ -487,7 +487,7 @@ sub embed_h {
$ret .= $replacelist;
if ($flags =~ /W/) {
if ($replacelist) {
$ret .= " _aDEPTH";
$ret .= " comma_aDEPTH";
} else {
die "Can't use W without other args (currently)";
}
Expand Down
Loading