Description
I was investigating why https://github.com/dimroc/etl-language-comparison/blob/517507b033dbb938dd1e83401914cbd4dd9a79bc/golang/search.go#L160 ends up significantly slower with regular expression.
strings.Contains(strings.ToLower(message), "knicks") // 7.8s
// versus
rx := regexp.MustCompile("(?i)knicks") // won't be done multiple times
rx.MatchString(message) // 21.7s
Profiling the etl example lead me to:
152.84s of 158.97s total (96.14%)
Dropped 167 nodes (cum <= 0.79s)
flat flat% sum% cum cum%
61.12s 38.45% 38.45% 61.25s 38.53% unicode.SimpleFold
22.62s 14.23% 52.68% 121.64s 76.52% regexp.(*machine).tryBacktrack
18.76s 11.80% 64.48% 18.76s 11.80% regexp.(*bitState).push
10.41s 6.55% 71.03% 71.66s 45.08% regexp/syntax.(*Inst).MatchRunePos
8.99s 5.66% 76.68% 9.41s 5.92% regexp.(*inputString).step
8.94s 5.62% 82.30% 135.55s 85.27% regexp.(*machine).backtrack
5.76s 3.62% 85.93% 5.76s 3.62% runtime.osyield
3.48s 2.19% 88.12% 75.14s 47.27% regexp/syntax.(*Inst).MatchRune
Which lead me to: https://github.com/golang/go/blob/master/src/regexp/syntax/prog.go#L213
I assume that there is a good reason for using unicode.SimpleFold
, i.e. some special unicode
symbols. When I changed it:
for r1 := unicode.SimpleFold(r0); r1 != r0; r1 = unicode.SimpleFold(r1) {
// into
if r1 := unicode.ToLower(r0); r1 == r || r1 == unicode.ToLower(r) {
It didn't break any tests... also I created easy0i = "(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ$"
in https://github.com/golang/go/blob/master/src/regexp/exec_test.go#L674 to measure the performance change:
// old
BenchmarkMatchEasy0i_32-8 500000 2860 ns/op 11.19 MB/s
BenchmarkMatchEasy0i_1K-8 20000 84204 ns/op 12.16 MB/s
BenchmarkMatchEasy0i_32K-8 500 3346191 ns/op 9.79 MB/s
BenchmarkMatchEasy0i_1M-8 10 107206130 ns/op 9.78 MB/s
BenchmarkMatchEasy0i_32M-8 1 3424195900 ns/op 9.80 MB/s
// new
BenchmarkMatchEasy0i_32-8 1000000 1831 ns/op 17.48 MB/s
BenchmarkMatchEasy0i_1K-8 30000 56369 ns/op 18.17 MB/s
BenchmarkMatchEasy0i_32K-8 1000 2276130 ns/op 14.40 MB/s
BenchmarkMatchEasy0i_1M-8 20 73004175 ns/op 14.36 MB/s
BenchmarkMatchEasy0i_32M-8 1 2334133500 ns/op 14.38 MB/s
So, is there a reason why unicode.SimpleFold
is used instead of something else, such as unicode.ToLower
? Either way, there might be some significant performance gains here for case-insensitive matches.
Activity