Skip to content

Commit d8d1ded

Browse files
committed
Improve handling of nested qr/(?[...])/
A set operations expression can contain a previously-compiled one interpolated in. Prior to this commit, some heuristics were employed to verify it actually was such a thing, and not a sort of look-alike that wasn't necessarily valid. The heuristics actually forbade legal ones. I don't know of any illegal ones that were let through, but it is certainly possible. Also, the error/warning messages referred to the heuristics, and were unhelpful at best. The technique used instead in this commit is to return a regop only used by this feature for any nested compilations. This guarantees that the caller can determine if the result is valid, and what that result is without having to do any heuristics or inspecting any flags. The error/warning messages are changed to reflect this, and I believe are now helpful. This fixes the bugs in #16779 #16779 (comment)
1 parent 9f55257 commit d8d1ded

File tree

8 files changed

+114
-79
lines changed

8 files changed

+114
-79
lines changed

embed.fnc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2583,6 +2583,8 @@ ES |regnode_offset|regnode_guts|NN RExC_state_t *pRExC_state \
25832583
ES |void |change_engine_size|NN RExC_state_t *pRExC_state|const Ptrdiff_t size
25842584
ES |regnode_offset|reganode|NN RExC_state_t *pRExC_state|U8 op \
25852585
|U32 arg
2586+
ES |regnode_offset|regpnode|NN RExC_state_t *pRExC_state|U8 op \
2587+
|NN void * arg
25862588
ES |regnode_offset|reg2Lanode|NN RExC_state_t *pRExC_state \
25872589
|const U8 op \
25882590
|const U32 arg1 \

embed.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1052,6 +1052,7 @@
10521052
#define reginsert(a,b,c,d) S_reginsert(aTHX_ a,b,c,d)
10531053
#define regnode_guts(a,b,c,d) S_regnode_guts(aTHX_ a,b,c,d)
10541054
#define regpiece(a,b,c) S_regpiece(aTHX_ a,b,c)
1055+
#define regpnode(a,b,c) S_regpnode(aTHX_ a,b,c)
10551056
#define regtail(a,b,c,d) S_regtail(aTHX_ a,b,c,d)
10561057
#define scan_commit(a,b,c,d) S_scan_commit(aTHX_ a,b,c,d)
10571058
#define set_ANYOF_arg(a,b,c,d,e) S_set_ANYOF_arg(aTHX_ a,b,c,d,e)

pod/perldelta.pod

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,14 @@ specification. There is clearly no demand for them, given that no one
8080
has ever complained in the many years the functions were claimed to be
8181
available, hence so-called "support" for them is now dropped.
8282

83+
=head2 A bug fix for C<(?[...])> may have caused some patterns to no
84+
longer compile
85+
86+
See L</Selected Bug Fixes>. The heuristics previously used may have let
87+
some constructs compile (perhaps not with the programmer's intended
88+
effect) that should have been errors. None are known, but it is
89+
possible that some erroneous constructs no longer compile.
90+
8391
=head1 Deprecations
8492

8593
XXX Any deprecated features, syntax, modules etc. should be listed here.
@@ -262,6 +270,12 @@ and New Warnings
262270

263271
XXX L<message|perldiag/"message">
264272

273+
L<Expecting interpolated extended charclass in regex; marked by <-- HERE in mE<sol>%sE<sol>
274+
|perldiag/"Expecting interpolated extended charclass in regex; marked by <-- HERE in mE<sol>%sE<sol>">
275+
276+
This is a replacement for several error messages listed under
277+
L</Changes to Existing Diagnostics>.
278+
265279
=back
266280

267281
=head3 New Warnings
@@ -357,6 +371,18 @@ Some instances of this message previously output the hex digits C<A>,
357371
C<B>, C<C>, C<D>, C<E>, and C<F> in lower case. Now they are all
358372
consistently upper case.
359373

374+
=item *
375+
376+
The following three diagnostics have been removed, and replaced by
377+
L<C<Expecting interpolated extended charclass in regex; marked by <-- HERE in mE<sol>%sE<sol>>
378+
|perldiag/"Expecting interpolated extended charclass in regex; marked by <-- HERE in mE<sol>%sE<sol>">.
379+
C<Expecting close paren for nested extended charclass in regex; marked
380+
by <-- HERE in mE<sol>%sE<sol>>,
381+
C<Expecting close paren for wrapper for nested extended charclass in
382+
regex; marked by <-- HERE in mE<sol>%sE<sol>>,
383+
and
384+
C<Expecting '(?flags:(?[...' in regex; marked by S<<-- HERE> in mE<sol>%sE<sol>>.
385+
360386
=back
361387

362388
=head1 Utility Changes
@@ -517,6 +543,14 @@ eg. on C<local %INC = %INC;>. This has been fixed [GH #17428]
517543
C<(?{...})> eval groups in regular expressions no longer unintentionally
518544
trigger "EVAL without pos change exceeded limit in regex" [GH #17490].
519545

546+
=item *
547+
548+
C<(?[...])> extended bracketed character classes do not wrongly raise an
549+
error on some cases where a previously-compiled such class is
550+
interpolated into another. The heuristics previously used have been
551+
replaced by a reliable method, and hence the diagnostics generated have
552+
changed. See L</Diagnostics>.
553+
520554
=back
521555

522556
=head1 Known Problems

pod/perldiag.pod

Lines changed: 9 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2292,36 +2292,18 @@ to denote a capturing group of the form
22922292
L<C<(?I<PARNO>)>|perlre/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)>,
22932293
but omitted the C<")">.
22942294

2295-
=item Expecting close paren for nested extended charclass in regex; marked
2296-
by <-- HERE in m/%s/
2297-
2298-
(F) While parsing a nested extended character class like:
2299-
2300-
(?[ ... (?flags:(?[ ... ])) ... ])
2301-
^
2302-
2303-
we expected to see a close paren ')' (marked by ^) but did not.
2304-
2305-
=item Expecting close paren for wrapper for nested extended charclass in
2306-
regex; marked by <-- HERE in m/%s/
2307-
2308-
(F) While parsing a nested extended character class like:
2309-
2310-
(?[ ... (?flags:(?[ ... ])) ... ])
2311-
^
2295+
=item Expecting interpolated extended charclass in regex; marked by <--
2296+
HERE in m/%s/
23122297

2313-
we expected to see a close paren ')' (marked by ^) but did not.
2298+
(F) It looked like you were attempting to interpolate an
2299+
already-compiled extended character class, like so:
23142300

2315-
=item Expecting '(?flags:(?[...' in regex; marked by S<<-- HERE> in m/%s/
2301+
my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
2302+
...
2303+
qr/(?[ \p{Digit} & $thai_or_lao ])/;
23162304

2317-
(F) The C<(?[...])> extended character class regular expression construct
2318-
only allows character classes (including character class escapes like
2319-
C<\d>), operators, and parentheses. The one exception is C<(?flags:...)>
2320-
containing at least one flag and exactly one C<(?[...])> construct.
2321-
This allows a regular expression containing just C<(?[...])> to be
2322-
interpolated. If you see this error message, then you probably
2323-
have some other C<(?...)> construct inside your character class. See
2324-
L<perlrecharclass/Extended Bracketed Character Classes>.
2305+
But the marked code isn't syntactically correct to be such an
2306+
interpolated class.
23252307

23262308
=item Experimental aliasing via reference not enabled
23272309

proto.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5715,6 +5715,9 @@ STATIC regnode_offset S_regnode_guts(pTHX_ RExC_state_t *pRExC_state, const U8 o
57155715
STATIC regnode_offset S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth);
57165716
#define PERL_ARGS_ASSERT_REGPIECE \
57175717
assert(pRExC_state); assert(flagp)
5718+
STATIC regnode_offset S_regpnode(pTHX_ RExC_state_t *pRExC_state, U8 op, void * arg);
5719+
#define PERL_ARGS_ASSERT_REGPNODE \
5720+
assert(pRExC_state); assert(arg)
57185721
STATIC bool S_regtail(pTHX_ RExC_state_t * pRExC_state, const regnode_offset p, const regnode_offset val, const U32 depth)
57195722
__attribute__warn_unused_result__;
57205723
#define PERL_ARGS_ASSERT_REGTAIL \

regcomp.c

Lines changed: 58 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,8 @@ struct RExC_state_t {
142142
U32 seen;
143143
SSize_t size; /* Number of regnode equivalents in
144144
pattern */
145+
Size_t sets_depth; /* Counts recursion depth of already-
146+
compiled regex set patterns */
145147

146148
/* position beyond 'precomp' of the warning message furthest away from
147149
* 'precomp'. During the parse, no warnings are raised for any problems
@@ -266,6 +268,7 @@ struct RExC_state_t {
266268
#define RExC_paren_names (pRExC_state->paren_names)
267269
#define RExC_recurse (pRExC_state->recurse)
268270
#define RExC_recurse_count (pRExC_state->recurse_count)
271+
#define RExC_sets_depth (pRExC_state->sets_depth)
269272
#define RExC_study_chunk_recursed (pRExC_state->study_chunk_recursed)
270273
#define RExC_study_chunk_recursed_bytes \
271274
(pRExC_state->study_chunk_recursed_bytes)
@@ -6421,6 +6424,11 @@ Perl_re_printf( aTHX_ "LHS=%" UVuf " RHS=%" UVuf "\n",
64216424
if (trie->jump) /* no more substrings -- for now /grr*/
64226425
flags &= ~SCF_DO_SUBSTR;
64236426
}
6427+
else if (OP(scan) == REGEX_SET) {
6428+
Perl_croak(aTHX_ "panic: %s regnode should be resolved"
6429+
" before optimization", reg_name[REGEX_SET]);
6430+
}
6431+
64246432
#endif /* old or new */
64256433
#endif /* TRIE_STUDY_OPT */
64266434

@@ -7670,6 +7678,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
76707678
RExC_study_chunk_recursed = NULL;
76717679
RExC_study_chunk_recursed_bytes= 0;
76727680
RExC_recurse_count = 0;
7681+
RExC_sets_depth = 0;
76737682
pRExC_state->code_index = 0;
76747683

76757684
/* Initialize the string in the compiled pattern. This is so that there is
@@ -16229,6 +16238,9 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
1622916238
&& UCHARAT(RExC_parse + 1) == '?'
1623016239
&& UCHARAT(RExC_parse + 2) == '^')
1623116240
{
16241+
const regnode_offset orig_emit = RExC_emit;
16242+
SV * resultant_invlist;
16243+
1623216244
/* If is a '(?^', could be an embedded '(?^flags:(?[...])'.
1623316245
* This happens when we have some thing like
1623416246
*
@@ -16238,62 +16250,33 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
1623816250
*
1623916251
* Here we would be handling the interpolated
1624016252
* '$thai_or_lao'. We handle this by a recursive call to
16241-
* ourselves which returns the inversion list the
16242-
* interpolated expression evaluates to. We use the flags
16243-
* from the interpolated pattern. */
16244-
U32 save_flags = RExC_flags;
16245-
const char * save_parse;
16246-
16247-
RExC_parse += 2; /* Skip past the '(?' */
16248-
save_parse = RExC_parse;
16249-
16250-
/* Parse the flags for the '(?'. We already know the first
16251-
* flag to parse is a '^' */
16252-
parse_lparen_question_flags(pRExC_state);
16253-
16254-
if ( RExC_parse >= RExC_end - 4
16255-
|| UCHARAT(RExC_parse) != ':'
16256-
|| UCHARAT(++RExC_parse) != '('
16257-
|| UCHARAT(++RExC_parse) != '?'
16258-
|| UCHARAT(++RExC_parse) != '[')
16259-
{
16253+
* reg which returns the inversion list the
16254+
* interpolated expression evaluates to. Actually, the
16255+
* return is a special regnode containing a pointer to that
16256+
* inversion list. If the return isn't that regnode alone,
16257+
* we know that this wasn't such an interpolation, which is
16258+
* an error: we need to get a single inversion list back
16259+
* from the recursion */
1626016260

16261-
/* In combination with the above, this moves the
16262-
* pointer to the point just after the first erroneous
16263-
* character. */
16264-
if (RExC_parse >= RExC_end - 4) {
16265-
RExC_parse = RExC_end;
16266-
}
16267-
else if (RExC_parse != save_parse) {
16268-
RExC_parse += (UTF)
16269-
? UTF8_SAFE_SKIP(RExC_parse, RExC_end)
16270-
: 1;
16271-
}
16272-
vFAIL("Expecting '(?flags:(?[...'");
16273-
}
16274-
16275-
/* Recurse, with the meat of the embedded expression */
1627616261
RExC_parse++;
16277-
if (! handle_regex_sets(pRExC_state, &current, flagp,
16278-
depth+1, oregcomp_parse))
16279-
{
16280-
RETURN_FAIL_ON_RESTART(*flagp, flagp);
16281-
}
16262+
RExC_sets_depth++;
1628216263

16283-
/* Here, 'current' contains the embedded expression's
16284-
* inversion list, and RExC_parse points to the trailing
16285-
* ']'; the next character should be the ')' */
16286-
RExC_parse++;
16287-
if (UCHARAT(RExC_parse) != ')')
16288-
vFAIL("Expecting close paren for nested extended charclass");
16264+
node = reg(pRExC_state, 2, flagp, depth+1);
16265+
RETURN_FAIL_ON_RESTART(*flagp, flagp);
1628916266

16290-
/* Then the ')' matching the original '(' handled by this
16291-
* case: statement */
16292-
RExC_parse++;
16293-
if (UCHARAT(RExC_parse) != ')')
16294-
vFAIL("Expecting close paren for wrapper for nested extended charclass");
16267+
if ( OP(REGNODE_p(node)) != REGEX_SET
16268+
/* If more than a single node returned, the nested
16269+
* parens evaluated to more than just a (?[...]),
16270+
* which isn't legal */
16271+
|| node != 1) {
16272+
vFAIL("Expecting interpolated extended charclass");
16273+
}
16274+
resultant_invlist = (SV *) ARGp(REGNODE_p(node));
16275+
current = invlist_clone(resultant_invlist, NULL);
16276+
SvREFCNT_dec(resultant_invlist);
1629516277

16296-
RExC_flags = save_flags;
16278+
RExC_sets_depth--;
16279+
RExC_emit = orig_emit;
1629716280
goto handle_operand;
1629816281
}
1629916282

@@ -16681,6 +16664,13 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
1668116664
return END;
1668216665
}
1668316666

16667+
if (RExC_sets_depth) { /* If within a recursive call, return in a special
16668+
regnode */
16669+
RExC_parse++;
16670+
node = regpnode(pRExC_state, REGEX_SET, (void *) final);
16671+
}
16672+
else {
16673+
1668416674
/* Otherwise generate a resultant node, based on 'final'. regclass() is
1668516675
* expecting a string of ranges and individual code points */
1668616676
invlist_iterinit(final);
@@ -16764,6 +16754,7 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
1676416754
ANYOF_FLAGS(REGNODE_p(node))
1676516755
|= ANYOFL_SHARED_UTF8_LOCALE_fold_HAS_MATCHES_nonfold_REQD;
1676616756
}
16757+
}
1676716758

1676816759
nextchar(pRExC_state);
1676916760
Set_Node_Length(REGNODE_p(node), RExC_parse - oregcomp_parse + 1); /* MJD */
@@ -20216,6 +20207,22 @@ S_reganode(pTHX_ RExC_state_t *pRExC_state, U8 op, U32 arg)
2021620207
return(ret);
2021720208
}
2021820209

20210+
/*
20211+
- regpnode - emit a temporary node with a void* argument
20212+
*/
20213+
STATIC regnode_offset /* Location. */
20214+
S_regpnode(pTHX_ RExC_state_t *pRExC_state, U8 op, void * arg)
20215+
{
20216+
const regnode_offset ret = regnode_guts(pRExC_state, op, regarglen[op], "regvnode");
20217+
regnode_offset ptr = ret;
20218+
20219+
PERL_ARGS_ASSERT_REGPNODE;
20220+
20221+
FILL_ADVANCE_NODE_ARGp(ptr, op, arg);
20222+
RExC_emit = ptr;
20223+
return(ret);
20224+
}
20225+
2021920226
STATIC regnode_offset
2022020227
S_reg2Lanode(pTHX_ RExC_state_t *pRExC_state, const U8 op, const U32 arg1, const I32 arg2)
2022120228
{

t/re/reg_mesg.t

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,7 @@ my @death =
311311
'/\p{Latin}{,4 }/' => 'Unescaped left brace in regex is illegal here {#} m/\p{Latin}{{#},4 }/',
312312
'/(?<=/' => 'Sequence (?... not terminated {#} m/(?<={#}/', # [perl #128170]
313313
'/\p{vertical tab}/' => 'Can\'t find Unicode property definition "vertical tab" {#} m/\\p{vertical tab}{#}/', # [perl #132055]
314-
"/$bug133423/" => "Operand with no preceding operator {#} m/(?[(?^:(?[\\
314+
"/$bug133423/" => "Unexpected ']' with no following ')' in (?[... {#} m/(?[(?^:(?[\\
315315
'/[^/' => 'Unmatched [ {#} m/[{#}^/', # [perl #133767]
316316
'/\p{Is_Other_Alphabetic=F}/ ' => 'Can\'t find Unicode property definition "Is_Other_Alphabetic=F" {#} m/\p{Is_Other_Alphabetic=F}{#}/',
317317
'/\p{Is_Other_Alphabetic=F}/ ' => 'Can\'t find Unicode property definition "Is_Other_Alphabetic=F" {#} m/\p{Is_Other_Alphabetic=F}{#}/',

t/re/regex_sets.t

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,12 @@ for my $char ("٠", "٥", "٩") {
220220
qr/(?[\P{Is0}])/', qr/\QUnknown user-defined property name "Is0"/, {}, "[perl #133889]");
221221
}
222222

223+
{
224+
my $s = qr/(?x:(?[ [ x ] ]))/;
225+
like("x", qr/(?[ $s ])/ , "Modifier flags in interpolated set don't"
226+
. " disrupt");
227+
}
228+
223229
done_testing();
224230

225231
1;

0 commit comments

Comments
 (0)