-
Notifications
You must be signed in to change notification settings - Fork 848
Description
https://www.php.net/manual/en/regexp.reference.unicode.php states:
That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.
This is false, because of this commit:
As per https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes:
By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match \D, \S, and \W, although this may vary for characters in the range 128-255 when locale-specific matching is happening. These escape sequences retain their original meanings from before Unicode support was available, mainly for efficiency reasons. If PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
The upper case escapes match the inverse sets of characters. Note that \d matches only decimal digits, whereas \w matches any Unicode digit, as well as any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and \B because they are defined in terms of \w and \W. Matching these sequences is noticeably slower when PCRE_UCP is set.