Skip to content

regexp: case-insensitive MatchString performance #13288

Closed
@egonelbre

Description

I was investigating why https://github.com/dimroc/etl-language-comparison/blob/517507b033dbb938dd1e83401914cbd4dd9a79bc/golang/search.go#L160 ends up significantly slower with regular expression.

strings.Contains(strings.ToLower(message), "knicks") // 7.8s
// versus
rx := regexp.MustCompile("(?i)knicks") // won't be done multiple times
rx.MatchString(message) // 21.7s

Profiling the etl example lead me to:

152.84s of 158.97s total (96.14%)
Dropped 167 nodes (cum <= 0.79s)
      flat  flat%   sum%        cum   cum%
    61.12s 38.45% 38.45%     61.25s 38.53%  unicode.SimpleFold
    22.62s 14.23% 52.68%    121.64s 76.52%  regexp.(*machine).tryBacktrack
    18.76s 11.80% 64.48%     18.76s 11.80%  regexp.(*bitState).push
    10.41s  6.55% 71.03%     71.66s 45.08%  regexp/syntax.(*Inst).MatchRunePos
     8.99s  5.66% 76.68%      9.41s  5.92%  regexp.(*inputString).step
     8.94s  5.62% 82.30%    135.55s 85.27%  regexp.(*machine).backtrack
     5.76s  3.62% 85.93%      5.76s  3.62%  runtime.osyield
     3.48s  2.19% 88.12%     75.14s 47.27%  regexp/syntax.(*Inst).MatchRune

Which lead me to: https://github.com/golang/go/blob/master/src/regexp/syntax/prog.go#L213

I assume that there is a good reason for using unicode.SimpleFold, i.e. some special unicode symbols. When I changed it:

for r1 := unicode.SimpleFold(r0); r1 != r0; r1 = unicode.SimpleFold(r1) {
// into
if r1 := unicode.ToLower(r0); r1 == r || r1 == unicode.ToLower(r) {

It didn't break any tests... also I created easy0i = "(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ$" in https://github.com/golang/go/blob/master/src/regexp/exec_test.go#L674 to measure the performance change:

// old
BenchmarkMatchEasy0i_32-8                 500000              2860 ns/op          11.19 MB/s
BenchmarkMatchEasy0i_1K-8                  20000             84204 ns/op          12.16 MB/s
BenchmarkMatchEasy0i_32K-8                   500           3346191 ns/op           9.79 MB/s
BenchmarkMatchEasy0i_1M-8                     10         107206130 ns/op           9.78 MB/s
BenchmarkMatchEasy0i_32M-8                     1        3424195900 ns/op           9.80 MB/s
// new
BenchmarkMatchEasy0i_32-8                1000000              1831 ns/op          17.48 MB/s
BenchmarkMatchEasy0i_1K-8                  30000             56369 ns/op          18.17 MB/s
BenchmarkMatchEasy0i_32K-8                  1000           2276130 ns/op          14.40 MB/s
BenchmarkMatchEasy0i_1M-8                     20          73004175 ns/op          14.36 MB/s
BenchmarkMatchEasy0i_32M-8                     1        2334133500 ns/op          14.38 MB/s

So, is there a reason why unicode.SimpleFold is used instead of something else, such as unicode.ToLower? Either way, there might be some significant performance gains here for case-insensitive matches.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions