Skip to content

pcre2test: tighten \x{...} parsing in subject #504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions doc/pcre2compat.3
Original file line number Diff line number Diff line change
Expand Up @@ -226,9 +226,15 @@ handled by PCRE2, either by the interpreter or the JIT. An example is
/(?:|(?0)abcd)(?(R)|\ez)/, which matches a sequence of any number of repeated
"abcd" substrings at the end of the subject.
.P
23. From release 10.45, PCRE2 gives an error if \ex is not followed by a
hexadecimal digit or a curly bracket. It used to interpret this as the NUL
character. Perl still generates NUL, but warns in its warning mode.
23. Both PCRE2 and Perl error when \ex{ escapes are invalid, but Perl tries to
recover and prints a warning if the problem was that an invalid hexadecimal
digit was found, since PCRE2 doesn't have warnings it returns an error instead.
Additionally, Perl accepts \ex{} and generates NUL unlike PCRE2.
.P
24. From release 10.45, PCRE2 gives an error if \ex is not followed by a
hexadecimal digit or a curly bracket. It used to interpret this as the NUL
character. Perl still generates NUL, but warns when in warning mode in most
cases.
.
.
.SH AUTHOR
Expand Down
2 changes: 1 addition & 1 deletion doc/pcre2test.1
Original file line number Diff line number Diff line change
Expand Up @@ -516,7 +516,7 @@ this makes it possible to construct invalid UTF-8 sequences for testing
purposes. On the other hand, \ex{hh} is interpreted as a UTF-8 character in
UTF-8 mode, generating more than one byte if the value is greater than 127.
When testing the 8-bit library not in UTF-8 mode, \ex{hh} generates one byte
for values less than 256, and causes an error for greater values.
for values that could fit on it, and causes an error for greater values.
.P
In UTF-16 mode, all 4-digit \ex{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes.
Expand Down
8 changes: 7 additions & 1 deletion perltest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,13 @@ for (;;)
}
else
{
$x = eval "\"$_\""; # To get escapes processed
s/(?<!\\)\\$//; # Remove pcre2test specific trailing backslash
$x = eval "\"$_\""; # To get escapes processed
if ($interact && $@)
{
print STDERR "$@";
redo;
}
}

# Empty array for holding results, ensure $REGERROR and $REGMARK are
Expand Down
54 changes: 28 additions & 26 deletions src/pcre2test.c
Original file line number Diff line number Diff line change
Expand Up @@ -7174,10 +7174,10 @@ while ((c = *p++) != 0)
break;

case 'x':
c = 0;
if (*p == '{')
{
uint8_t *pt = p;
c = 0;

/* We used to have "while (isxdigit(*(++pt)))" here, but it fails
when isxdigit() is a macro that refers to its argument more than
Expand All @@ -7187,36 +7187,41 @@ while ((c = *p++) != 0)
for (pt++; isxdigit(*pt); pt++)
{
if (++i == 9)
{
fprintf(outfile, "** Too many hex digits in \\x{...} item; "
"using only the first eight.\n");
else c = c * 16 + (tolower(*pt) - ((isdigit(*pt))? '0' : 'a' - 10));
while (isxdigit(*pt)) pt++;
break;
}
else c = c * 16 + (tolower(*pt) - (isdigit(*pt)? '0' : 'a' - 10));
}
if (*pt == '}')
if (i == 0 || *pt != '}')
{
p = pt + 1;
break;
fprintf(outfile, "** Malformed \\x{ escape\n");
return PR_OK;
}
/* Not correct form for \x{...}; fall through */
else p = pt + 1;
}

/* \x without {} always defines just one byte in 8-bit mode. This
allows UTF-8 characters to be constructed byte by byte, and also allows
invalid UTF-8 sequences to be made. Just copy the byte in UTF-8 mode.
Otherwise, pass it down as data. */

c = 0;
while (i++ < 2 && isxdigit(*p))
else
{
c = c * 16 + (tolower(*p) - ((isdigit(*p))? '0' : 'a' - 10));
p++;
}
/* \x without {} always defines just one byte in 8-bit mode. This
allows UTF-8 characters to be constructed byte by byte, and also allows
invalid UTF-8 sequences to be made. Just copy the byte in UTF-8 mode.
Otherwise, pass it down as data. */

while (i++ < 2 && isxdigit(*p))
{
c = c * 16 + (tolower(*p) - (isdigit(*p)? '0' : 'a' - 10));
p++;
}
#if defined SUPPORT_PCRE2_8
if (utf && (test_mode == PCRE8_MODE))
{
*q8++ = c;
continue;
}
if (utf && (test_mode == PCRE8_MODE))
{
*q8++ = c;
continue;
}
#endif
}
break;

case 0: /* \ followed by EOF allows for an empty line */
Expand Down Expand Up @@ -7309,10 +7314,7 @@ while ((c = *p++) != 0)
}
#endif
#ifdef SUPPORT_PCRE2_32
if (test_mode == PCRE32_MODE)
{
*q32++ = c;
}
if (test_mode == PCRE32_MODE) *q32++ = c;
#endif
}

Expand Down
3 changes: 0 additions & 3 deletions testdata/testinput10
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,6 @@
\x{c0}
\x{f0}

/Ā{3,4}/IB,utf
\x{100}\x{100}\x{100}\x{100\x{100}

/(\x{100}+|x)/IB,utf

/(\x{100}*a|x)/IB,utf
Expand Down
3 changes: 0 additions & 3 deletions testdata/testinput12
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,6 @@
\x{c0}
\x{f0}

/Ā{3,4}/IB,utf
\x{100}\x{100}\x{100}\x{100\x{100}

/(\x{100}+|x)/IB,utf

/(\x{100}*a|x)/IB,utf
Expand Down
16 changes: 0 additions & 16 deletions testdata/testoutput10
Original file line number Diff line number Diff line change
Expand Up @@ -492,22 +492,6 @@ No match
\x{f0}
No match

/Ā{3,4}/IB,utf
------------------------------------------------------------------
Bra
\x{100}{3}
\x{100}?+
Ket
End
------------------------------------------------------------------
Capture group count = 0
Options: utf
First code unit = \xc4
Last code unit = \x80
Subject length lower bound = 3
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}

/(\x{100}+|x)/IB,utf
------------------------------------------------------------------
Bra
Expand Down
16 changes: 0 additions & 16 deletions testdata/testoutput12-16
Original file line number Diff line number Diff line change
Expand Up @@ -273,22 +273,6 @@ No match
\x{f0}
No match

/Ā{3,4}/IB,utf
------------------------------------------------------------------
Bra
\x{100}{3}
\x{100}?+
Ket
End
------------------------------------------------------------------
Capture group count = 0
Options: utf
First code unit = \x{100}
Last code unit = \x{100}
Subject length lower bound = 3
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}

/(\x{100}+|x)/IB,utf
------------------------------------------------------------------
Bra
Expand Down
16 changes: 0 additions & 16 deletions testdata/testoutput12-32
Original file line number Diff line number Diff line change
Expand Up @@ -268,22 +268,6 @@ No match
\x{f0}
No match

/Ā{3,4}/IB,utf
------------------------------------------------------------------
Bra
\x{100}{3}
\x{100}?+
Ket
End
------------------------------------------------------------------
Capture group count = 0
Options: utf
First code unit = \x{100}
Last code unit = \x{100}
Subject length lower bound = 3
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}

/(\x{100}+|x)/IB,utf
------------------------------------------------------------------
Bra
Expand Down
Loading