Skip to content

re: documentation claim that special characters lose their special meaning inside […] seems wrong #106482

Open
@calestyo

Description

@calestyo

Documentation

The claim at:

cpython/Doc/library/re.rst

Lines 253 to 255 in d0c6ba9

* Special characters lose their special meaning inside sets. For example,
``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
``'*'``, or ``')'``.

seems wrong at least for \.

Consider the following example:

>>> bool(re.search(string=b"a\\b",pattern=b"[\\\n\r]"))
False

My expectation would be that after backslash-unescaping the b"…"-string, pattern is assigned the sequence of:

literal \, the line-feed "character", the carriage-return "character"

If it would be true, that "Special characters lose their special meaning inside sets.", then the resolved \ in the unescaped pattern should match the one in my test string b"a\\b", however it does not.

I guess what Python actually "sees" is:

backslash-escaped line-feed "character", the carriage-return "character"

which probably effectively yields:

the line-feed "character", the carriage-return "character"

Now you could argue that the \ is not considered a special-character for the terms of the regular expression syntax... but it is, at least already because of:

cpython/Doc/library/re.rst

Lines 504 to 507 in d0c6ba9

The special sequences consist of ``'\'`` and a character from the list below.
If the ordinary character is not an ASCII digit or an ASCII letter, then the
resulting RE will match the second character. For example, ``\$`` matches the
character ``'$'``.

and ff..

Also, even the section that explains […] mentions the escaping functionality of it:

cpython/Doc/library/re.rst

Lines 249 to 250 in d0c6ba9

``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
``[a\-z]``) or if it's placed as the first or last character

I think:

cpython/Doc/library/re.rst

Lines 253 to 255 in d0c6ba9

* Special characters lose their special meaning inside sets. For example,
``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
``'*'``, or ``')'``.

should be improved to document that:

  • \ is exempt from this
  • whether or this is only the case for characters that are actually special with respect to the RE bracket expression, i.e. [0\-9] is 0, - and 9, because the - was special in that position. But what about [\-9]? Here, the - would not have been special, so it the result \, - and 9 or just - and 9?
  • or whether this is simply the case for any character following the \ ... ones that are special outside and RE bracket expression, like \$, \D. \w or \number... and/or ones that are never special, like .

Thanks,
Chris.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions