Skip to content

PCRE Unicode reference contains incorrect statement #2831

@TimWolla

Description

@TimWolla

https://www.php.net/manual/en/regexp.reference.unicode.php states:

That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.

This is false, because of this commit:

php/php-src@87a2373

As per https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes:

By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match \D, \S, and \W, although this may vary for characters in the range 128-255 when locale-specific matching is happening. These escape sequences retain their original meanings from before Unicode support was available, mainly for efficiency reasons. If PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
The upper case escapes match the inverse sets of characters. Note that \d matches only decimal digits, whereas \w matches any Unicode digit, as well as any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and \B because they are defined in terms of \w and \W. Matching these sequences is noticeably slower when PCRE_UCP is set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions