-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp: POSIX regexp takes 4 seconds to execute #11181
Comments
interesting. playground seems to give up. |
I put a simplified test case at https://play.golang.org/p/Pkm1dxj2Pq The problem is the "[^!]+{999}" You can make the regular expression take an arbitrarily long time by just tacking more copies of that string onto the end of the regexp in the simplified example I gave. The nested repetition currently only causes an regexp parse error when using the PerlX syntax flag. |
Thanks, @tolchz Here is arguably even simpler version, runs for 8 seconds
|
CL https://golang.org/cl/11080 mentions this issue. |
More fun can be had too:
real 0m32.585s |
google/re2@7444e38 is probably RTYI. I could work on porting this to Go if you like. :) |
This is working as intended. The repetition is just that, a repetition. If you make that number bigger, the time takes longer. It's linear in the size of the regexp where the size of a{1000} is 1000. It's also linear in the size of the input: O(nm) where n is text size and m is regexp size. This is more or less fundamental. |
It is not linear. |
It makes sense that runtime is linear in n, but only for a fixed regexp size m. We must also interpret it as "worst case upper bound is linear", and this isn't a promise that average case will grow linearly. I ran a simulation with pattern "|A+{m}", with various values of n from 1 to 1000, and m from 1 to 1000, here the result (high durations in red) : This shows that the worst case is obtained with n = m - 1 We can't write pattern "|A+{m}" with m > 1000, however it must be equivalent (iiuc) to expanded pattern "|(A+A+A+ ... A+)" with m repetitions of A+ . So I ran a second simulation with n = m - 1 , with various values of m from 1 to 2000 :
and plotted the runtime on Y axis : According to the "linear in m and linear in n" promise, we should have worst time proportional to m^2 . Unfortunatly, the results (for this particular pattern and input string) clearly indicate that worst time is proportional to m^3 . |
@Deleplace very interesting, thanks. I read a paper recently (can't share it yet sorry) that might help us implement counted repetition more efficienctly, but even without that I am a bit surprised by the m^3. I'd like to understand that better, but I'm about to go on leave for the summer. I'll flag this for Go 1.8 and hopefully someone will be able to investigate it then. /cc @junyer FYI |
Oddly enough, the |
... and if you place the |
Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again. So you have m^2 for each match attempt, as promised, and then m match attempts as well. Mystery solved. We may still be able to do a little better, but that particular cubic (really quadratic) is unavoidable. |
Just based on your description, if there is no match in the whole string, how can there be a match in a substring? |
Empty string is a valid match, and it is present many times in the result slice : |
@Deleplace I see, thanks. |
We know why this happens and it seems unavoidable. |
The following FindAllString takes 4 seconds to execute. Not sure whether there is a performance problem or not, but the package claims to have linear execution time, so that's 5.5 ms per character which looks like an issue.
go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64
The text was updated successfully, but these errors were encountered: