Skip to content

Complementary re patterns such as [\s\S] or [\w\W] are much slower than . with DOTALL  #111259

Closed
@pan324

Description

Bug report

Bug description:

import re
from time import perf_counter as time

p1 = re.compile(r"[\s\S]*")
p2 = re.compile(".*", re.DOTALL)

s = "a"*10000
for p in (p1,p2):
    t0 = time()
    for i in range(10000): _=p.match(s)
    print(time()-t0)

Runtimes are 0.44 s vs 0.0016 s on my system. Instead of simplification, the [\s\S] is stepped through one after another. \s does not match so then \S is checked (the order [\S\s] is twice as fast for the string here). This is not solely an issue for larger matches. A 40 char string is processed half as fast when using [\s\S]. Even 10 chars take about 25% longer to process. I'm not completely sure whether this qualifies as a bug or an issue with documentation. Other languages don't have the DOTALL option and always rely on the first option. Plenty of posts on SO and elsewhere will thus advocate using [\s\S] as an all-matching regex pattern. Unsuspecting Python programmers such as @barneygale may expect [\s\S] to be identical to using a dot with DOTALL as seen below.

@serhiy-storchaka

cpython/Lib/pathlib.py

Lines 126 to 133 in 9bb202a

elif part == '**\n':
# '**/' component: we use '[\s\S]' rather than '.' so that path
# separators (i.e. newlines) are matched. The trailing '^' ensures
# we terminate after a path separator (i.e. on a new line).
part = r'[\s\S]*^'
elif part == '**':
# '**' component.
part = r'[\s\S]*'

CPython versions tested on:

3.11, 3.13

Operating systems tested on:

Linux, Windows

Linked PRs

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions