Complementary re patterns such as [\s\S] or [\w\W] are much slower than . with DOTALL #111259
Description
Bug report
Bug description:
import re
from time import perf_counter as time
p1 = re.compile(r"[\s\S]*")
p2 = re.compile(".*", re.DOTALL)
s = "a"*10000
for p in (p1,p2):
t0 = time()
for i in range(10000): _=p.match(s)
print(time()-t0)
Runtimes are 0.44 s vs 0.0016 s on my system. Instead of simplification, the [\s\S] is stepped through one after another. \s does not match so then \S is checked (the order [\S\s] is twice as fast for the string here). This is not solely an issue for larger matches. A 40 char string is processed half as fast when using [\s\S]. Even 10 chars take about 25% longer to process. I'm not completely sure whether this qualifies as a bug or an issue with documentation. Other languages don't have the DOTALL option and always rely on the first option. Plenty of posts on SO and elsewhere will thus advocate using [\s\S] as an all-matching regex pattern. Unsuspecting Python programmers such as @barneygale may expect [\s\S] to be identical to using a dot with DOTALL as seen below.
Lines 126 to 133 in 9bb202a
CPython versions tested on:
3.11, 3.13
Operating systems tested on:
Linux, Windows
Linked PRs
- gh-111259: Optimize recursive wildcards in pathlib #111303
- gh-111259: Optimize complementary character sets in RE #120742
- gh-111259: Document idiomatic RE pattern (?s:.) that matches any character #120745
- [3.13] gh-111259: Document idiomatic RE pattern (?s:.) that matches any character (GH-120745) #120813
- [3.12] gh-111259: Document idiomatic RE pattern (?s:.) that matches any character (GH-120745) #120814