Description
The regexp
package has 3 different matchers (NFA, onepass, backtrack). One of them is selected depending on the pattern.
I suggest adding a 4th matcher optimized for fixed-length patterns like a.ab$
on strings.
I wrote a proof-of-concept implementation which provides constant-time matching relative to the size of the input string in many cases, whereas the current matchers from regexp
perform linearly. Performance becomes close to what can be achieved with methods like strings.Index
.
Here is a sample benchmark result of regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b")
:
M=1000
BenchmarkRegexpBypass/DotSuffix/bypass-8 30000000 109 ns/op # New matcher implementation
BenchmarkRegexpBypass/DotSuffix/std-8 50000 69882 ns/op # Go's current standard library
BenchmarkRegexpBypass/DotSuffix/pcre-8 100000 41359 ns/op # PCRE
BenchmarkRegexpBypass/DotSuffix/rust-8 20000000 157 ns/op # Rust regexp engine
I've added more details, benchmarks and tests (with fuzzing!) to the repository. Not sure how much should be inlined here.
The obvious cons of this proposal are:
- Adding more code to the standard library
- Adding more work at compilation time.
The pros are:
- Constant-time matching for many simple regexps (relative to the size of the input string)
- No overhead for regexps that are not supported
- Go's
regexp
package is usually considered immature performance-wise. This proposal plays a small role in fixing that by adding optimizations that can reasonably be expected from the end-user. - This matcher keeps very little state and bypasses the mutex from
regexp.go
- There are already 3 different matchers in the standard library (4 with the upcoming DFA), so adding a new one for a specific kind of patterns is not surprising.
regexp.MatchString("(?:png|jpg)$")
could obviously be rewritten asstrings.HasSuffix("png") or strings.HasSuffix("jpg")
but sometimes it is not practical because the pattern to be matched is user-supplied or part of a long list of patterns. Examples include interactive log search or lists of paths in HTTP routers.- Limited risk due to exhaustive tests in the standard library and additional fuzzing
Feedback would be highly appreciated. Thanks!!