Skip to content

Commit 9d53c45

Browse files
author
Karl Williamson
committed
PATCH: [perl #89774] multi-char fold + its fold in char class
The design for handling characters that fold to multiple characters when the former are encountered in a bracketed character class is defective. The ticket reads, "If a bracketed character class includes a character that has a multi-char fold, and it also includes the first character of that fold, the multi-char fold will never be matched; just the first character of the fold.". Thus, in the class /[\0-\xff]/i, \xDF will never be matched, because its fold is 'ss', the first character of which, 's', is also in the class. The reason the design is defective is that it doesn't allow for backtracking and trying the other options. This commit solves this by effectively rewriting the above to be / (?: \xdf | [\0-\xde\xe0-\xff] ) /xi. And so the backtracking gets handled automatcially by the regex engine.
1 parent 9ffebac commit 9d53c45

File tree

8 files changed

+247
-35
lines changed

8 files changed

+247
-35
lines changed

embedvar.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
#define PL_DBtrace (vTHX->IDBtrace)
5454
#define PL_Dir (vTHX->IDir)
5555
#define PL_Env (vTHX->IEnv)
56+
#define PL_HasMultiCharFold (vTHX->IHasMultiCharFold)
5657
#define PL_L1Cased (vTHX->IL1Cased)
5758
#define PL_L1PosixAlnum (vTHX->IL1PosixAlnum)
5859
#define PL_L1PosixAlpha (vTHX->IL1PosixAlpha)

intrpvar.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -609,6 +609,7 @@ PERLVAR(I, XPosixXDigit, SV *)
609609
PERLVAR(I, VertSpace, SV *)
610610

611611
PERLVAR(I, NonL1NonFinalFold, SV *)
612+
PERLVAR(I, HasMultiCharFold, SV *)
612613

613614
/* utf8 character class swashes */
614615
PERLVAR(I, utf8_alnum, SV *)

pod/perldelta.pod

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,30 @@ XXX For a release on a stable branch, this section aspires to be:
5353

5454
[ List each incompatible change as a =head2 entry ]
5555

56+
=head2 New Restrictions in Multi-Character Case-Insensitive Matching in Regular Expression Bracketed Character Classes
57+
58+
Unicode has now withdrawn their previous recommendation for regular
59+
expressions to automatically handle cases where a single character can
60+
match multiple characters case-insensitively; for example, the letter
61+
LATIN SMALL LETTER SHARP S and the sequence C<ss>. This is because
62+
it turns out to be impracticable to do this correctly in all
63+
circumstances. Because Perl has tried to do this as best it can, it
64+
will continue to do so. (We are considering an option to turn it off.)
65+
However, a new restriction is being added on such matches when they
66+
occur in [bracketed] character classes. People were specifying
67+
things such as C</[\0-\xff]/i>, and being surprised that it matches the
68+
two character sequence C<ss> (since LATIN SMALL LETTER SHARP S occurs in
69+
this range). This behavior is also inconsistent with the using a
70+
property instead of a range: C<\p{Block=Latin1}> also includes LATIN
71+
SMALL LETTER SHARP S, but C</[\p{Block=Latin1}]/i> does not match C<ss>.
72+
The new rule is that for there to be a multi-character case-insensitive
73+
match within a bracketed character class, the character must be
74+
explicitly listed, and not as an end point of a range. This more
75+
closely obeys the Principle of Least Astonishment. See
76+
L<perlrecharclass/Bracketed Character Classes>. Note that a bug [perl
77+
#89774], now fixed as part of this change, prevented the previous
78+
behavior from working fully.
79+
5680
=head1 Deprecations
5781

5882
XXX Any deprecated features, syntax, modules etc. should be listed here. In
@@ -315,7 +339,9 @@ well.
315339

316340
=item *
317341

318-
XXX
342+
Case-insensitive matching inside a [bracketed] character class with a
343+
multi-character fold, no longer excludes one of the possibilities in the
344+
circumstances that it used to. [perl #89774].
319345

320346
=back
321347

pod/perlre.pod

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -72,27 +72,13 @@ are split between groupings, or when one or more are quantified. Thus
7272
# be even if it did!!
7373
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
7474

75-
Perl doesn't match multiple characters in an inverted bracketed
76-
character class, which otherwise could be highly confusing. See
75+
Perl doesn't match multiple characters in a bracketed
76+
character class unless the character that maps to them is explicitly
77+
mentioned, and it doesn't match them at all if the character class is
78+
inverted, which otherwise could be highly confusing. See
79+
L<perlrecharclass/Bracketed Character Classes>, and
7780
L<perlrecharclass/Negation>.
7881

79-
Another bug involves character classes that match both a sequence of
80-
multiple characters, and an initial sub-string of that sequence. For
81-
example,
82-
83-
/[s\xDF]/i
84-
85-
should match both a single and a double "s", since C<\xDF> (on ASCII
86-
platforms) matches "ss". However, this bug
87-
(L<[perl #89774]|https://rt.perl.org/rt3/Ticket/Display.html?id=89774>)
88-
causes it to only match a single "s", even if the final larger match
89-
fails, and matching the double "ss" would have succeeded.
90-
91-
Also, Perl matching doesn't fully conform to the current Unicode C</i>
92-
recommendations, which ask that the matching be made upon the NFD
93-
(Normalization Form Decomposed) of the text. However, Unicode is
94-
in the process of reconsidering and revising their recommendations.
95-
9682
=item x
9783
X</x>
9884

pod/perlrecharclass.pod

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -441,7 +441,8 @@ Examples:
441441

442442
* There is an exception to a bracketed character class matching a
443443
single character only. When the class is to match caselessly under C</i>
444-
matching rules, and a character inside the class matches a
444+
matching rules, and a character that is explicitly mentioned inside the
445+
class matches a
445446
multiple-character sequence caselessly under Unicode rules, the class
446447
(when not L<inverted|/Negation>) will also match that sequence. For
447448
example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
@@ -450,6 +451,18 @@ should match the sequence C<ss> under C</i> rules. Thus,
450451
'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches
451452
'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
452453

454+
For this to happen, the character must be explicitly specified, and not
455+
be part of a multi-character range (not even as one of its endpoints).
456+
(L</Character Ranges> will be explained shortly.) Therefore,
457+
458+
'ss' =~ /\A[\0-\x{ff}]\z/i # Doesn't match
459+
'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i # No match
460+
'ss' =~ /\A[\xDF-\xDF]\z/i # Matches on ASCII platforms, since \XDF
461+
# is LATIN SMALL LETTER SHARP S, and the
462+
# range is just a single element
463+
464+
Note that it isn't a good idea to specify these types of ranges anyway.
465+
453466
=head3 Special Characters Inside a Bracketed Character Class
454467

455468
Most characters that are meta characters in regular expressions (that

0 commit comments

Comments
 (0)