Skip to content

proposal: regexp: Optimize fixed-length patterns #21463

Open
@sylvinus

Description

@sylvinus

The regexp package has 3 different matchers (NFA, onepass, backtrack). One of them is selected depending on the pattern.

I suggest adding a 4th matcher optimized for fixed-length patterns like a.ab$ on strings.
I wrote a proof-of-concept implementation which provides constant-time matching relative to the size of the input string in many cases, whereas the current matchers from regexp perform linearly. Performance becomes close to what can be achieved with methods like strings.Index.

Here is a sample benchmark result of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b"):

M=1000
BenchmarkRegexpBypass/DotSuffix/bypass-8   	30000000	    109 ns/op  # New matcher implementation
BenchmarkRegexpBypass/DotSuffix/std-8      	   50000	  69882 ns/op  # Go's current standard library
BenchmarkRegexpBypass/DotSuffix/pcre-8     	  100000	  41359 ns/op  # PCRE
BenchmarkRegexpBypass/DotSuffix/rust-8     	20000000	    157 ns/op  # Rust regexp engine

I've added more details, benchmarks and tests (with fuzzing!) to the repository. Not sure how much should be inlined here.

The obvious cons of this proposal are:

  • Adding more code to the standard library
  • Adding more work at compilation time.

The pros are:

  • Constant-time matching for many simple regexps (relative to the size of the input string)
  • No overhead for regexps that are not supported
  • Go's regexp package is usually considered immature performance-wise. This proposal plays a small role in fixing that by adding optimizations that can reasonably be expected from the end-user.
  • This matcher keeps very little state and bypasses the mutex from regexp.go
  • There are already 3 different matchers in the standard library (4 with the upcoming DFA), so adding a new one for a specific kind of patterns is not surprising.
  • regexp.MatchString("(?:png|jpg)$") could obviously be rewritten as strings.HasSuffix("png") or strings.HasSuffix("jpg") but sometimes it is not practical because the pattern to be matched is user-supplied or part of a long list of patterns. Examples include interactive log search or lists of paths in HTTP routers.
  • Limited risk due to exhaustive tests in the standard library and additional fuzzing

Feedback would be highly appreciated. Thanks!!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions