Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: POSIX regexp takes 4 seconds to execute #11181

Closed
dvyukov opened this issue Jun 12, 2015 · 17 comments
Closed

regexp: POSIX regexp takes 4 seconds to execute #11181

dvyukov opened this issue Jun 12, 2015 · 17 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Jun 12, 2015

The following FindAllString takes 4 seconds to execute. Not sure whether there is a performance problem or not, but the package claims to have linear execution time, so that's 5.5 ms per character which looks like an issue.

package main

import "regexp"

func main() {
    re := regexp.MustCompilePOSIX("|{0x6B27143FFb2C6.0xAfbb212adA07dC16E19-442426454885445967343337981592e-0xFDbfC4aD3F17CFC4Bd80x0C9aedB8cFCAb57bBD9AcDdfE10858856}^{-0363642521013670e-795506009086632}b|^050226227030136e1509816632298701486455993285303619190b|([^!]+{953})|^J4__P17Y#^^{0035.-0610260367-128}^f__R_f6_s_v6_8XLItr7_1_B__gxdp6ko81_tXI_4m94u9_q__78PA7_1___8tct__kY8_94|b5165Jp2_3N_xg3_|$\x02Q9m5C023w_Ej0_guW2%8_90a1\x0fAzPq4d_J__Kre4nkw3G\ftwentyeighttwentynineirtyone|99hSe9M7q1M_jbxbwJY4YFxDZ___Q_t2v___w_NeXertyeight|thirtynine|[AZ-]tyfive|D_||ortyeight|_ki_xbuJvO__102___2_KSS8Rkp2C__|fy||6G5a_15Zgx58__56E0__9bBfkKB_37_b9N__IFZ_n3EMk2__p|T__OFK__O2E3>|ht|$|||isie||efoursevsevnt||o|_6U|eix|_4aN964__A34lcQq|eightyeightthirtyeight|thirtynine_L")
    re.FindAllString("S__c2_Oe_5TK2cNDf2aC_F4y_Vwi____ggH_jR_95b_R8vFs8NEBs_5Ikmf5S8Laexw40C[five|X2R8r5__A9_u__c__O__150G_az_MbK__04XGMmF_R_h_FwYp_g5_2___bic|seven|PS1|nine|71U|b1_U0a_t_4VKEqIRq7_2_g_giC1Hj_2__q9__La__7vmz_GnPT_Y5_iw1Wu__k33V_WW_gi__6__D_Vw_2qb_32|__80Tj6_e9F_PL__c1_03_xm2TK_1820bcWKL4B_8_T6|fourteen|fifteen|[^]\x1bseventeen\x13<nineteen|\t|\x05|twentytwo_|\x12twentyfive|\x16\x00]twentyeight|twentynine|FVSv_pp3cE|thirtyone|\x03^|_|thirtyfive|thirtysix|:^xdigitthree|Wc___7_81DL__2JZ4S1u_lY6U9sN_1tw8bI6__k__f1Tn_gsN_EeQ4H2KR_9r_V_N5_t4XuIpdzt_WD3Uvq_X9_rWw_A__X|five|six\x0f83|G_h|nine|ten|eleventwelve|thirteen|-0xaaaaDe20daBF3baB|_miAsixteen|seventeen|eighteen|nineteenZa:3\"}('(|?b\x00\x00ascii[AZ-]D)e_4bVe___53_08__QX[:\x1b__fXM_C0_bU:]|^{32}\x16'{$}", -1)
}

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

@mattn
Copy link
Member

mattn commented Jun 12, 2015

interesting. playground seems to give up.

http://play.golang.org/p/YIrDvZx0Fk

@ianlancetaylor ianlancetaylor added this to the Go1.6 milestone Jun 12, 2015
@tzneal
Copy link
Member

tzneal commented Jun 13, 2015

I put a simplified test case at https://play.golang.org/p/Pkm1dxj2Pq

The problem is the "[^!]+{999}" You can make the regular expression take an arbitrarily long time by just tacking more copies of that string onto the end of the regexp in the simplified example I gave.

The nested repetition currently only causes an regexp parse error when using the PerlX syntax flag.

https://go-review.googlesource.com/11080/

@dvyukov
Copy link
Member Author

dvyukov commented Jun 13, 2015

Thanks, @tolchz

Here is arguably even simpler version, runs for 8 seconds

package main

import (
    "regexp"
    "strings"
)

func main() {
    re := regexp.MustCompilePOSIX("|A+{1000}")
    re.FindAllString(strings.Repeat("A", 999), -1)
}

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/11080 mentions this issue.

@tzneal
Copy link
Member

tzneal commented Jun 15, 2015

More fun can be had too:

package main
import (
    "regexp"
    "strings"
)

func main() {
    re := regexp.MustCompilePOSIX("|A+{1000}|[^B]+{1000}|[^C]+{1000}|[^D]+{1000}")
    re.FindAllString(strings.Repeat("A", 999), -1)
}

real 0m32.585s
user 0m32.637s
sys 0m0.008s

@junyer
Copy link
Contributor

junyer commented Jul 24, 2015

google/re2@7444e38 is probably RTYI. I could work on porting this to Go if you like. :)

(Ping, @kcc and @rsc.)

@rsc rsc modified the milestones: Unplanned, Go1.6 Oct 14, 2015
@rsc
Copy link
Contributor

rsc commented Oct 14, 2015

This is working as intended. The repetition is just that, a repetition. If you make that number bigger, the time takes longer. It's linear in the size of the regexp where the size of a{1000} is 1000. It's also linear in the size of the input: O(nm) where n is text size and m is regexp size. This is more or less fundamental.

@dvyukov
Copy link
Member Author

dvyukov commented Oct 14, 2015

It is not linear.
"|A+{500}" takes 0.2s
"|A+{1000}" takes 8s

@Deleplace
Copy link
Contributor

It makes sense that runtime is linear in n, but only for a fixed regexp size m.

We must also interpret it as "worst case upper bound is linear", and this isn't a promise that average case will grow linearly.

I ran a simulation with pattern "|A+{m}", with various values of n from 1 to 1000, and m from 1 to 1000, here the result (high durations in red) :

regexp_not_linear_plot

This shows that the worst case is obtained with n = m - 1

We can't write pattern "|A+{m}" with m > 1000, however it must be equivalent (iiuc) to expanded pattern "|(A+A+A+ ... A+)" with m repetitions of A+ .

So I ran a second simulation with n = m - 1 , with various values of m from 1 to 2000 :

re = regexp.MustCompilePOSIX("|(" + strings.Repeat("A+", m) + ")") re.FindAllString(strings.Repeat("A", m-1), -1)

and plotted the runtime on Y axis :

regex_time_curves

According to the "linear in m and linear in n" promise, we should have worst time proportional to m^2 .

Unfortunatly, the results (for this particular pattern and input string) clearly indicate that worst time is proportional to m^3 .

@rsc rsc modified the milestones: Go1.8, Unplanned May 23, 2016
@rsc
Copy link
Contributor

rsc commented May 23, 2016

@Deleplace very interesting, thanks. I read a paper recently (can't share it yet sorry) that might help us implement counted repetition more efficienctly, but even without that I am a bit surprised by the m^3. I'd like to understand that better, but I'm about to go on leave for the summer. I'll flag this for Go 1.8 and hopefully someone will be able to investigate it then.

/cc @junyer FYI

@junyer
Copy link
Contributor

junyer commented May 23, 2016

Oddly enough, the | seems to be causing the problems here. If you remove it, POSIX mode matches just as quickly as Perl mode.

@junyer
Copy link
Contributor

junyer commented May 23, 2016

... and if you place the | at the end instead, Perl mode matches just as slowly as POSIX mode. :)

@rsc
Copy link
Contributor

rsc commented May 23, 2016

Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again. So you have m^2 for each match attempt, as promised, and then m match attempts as well. Mystery solved. We may still be able to do a little better, but that particular cubic (really quadratic) is unavoidable.

@dvyukov
Copy link
Member Author

dvyukov commented May 29, 2016

Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again.

Just based on your description, if there is no match in the whole string, how can there be a match in a substring?

@Deleplace
Copy link
Contributor

Empty string is a valid match, and it is present many times in the result slice :
https://play.golang.org/p/mJN6sJWJ68

@dvyukov
Copy link
Member Author

dvyukov commented May 29, 2016

@Deleplace I see, thanks.

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 10, 2016
@rsc
Copy link
Contributor

rsc commented Oct 18, 2016

We know why this happens and it seems unavoidable.

@rsc rsc closed this as completed Oct 18, 2016
@golang golang locked and limited conversation to collaborators Oct 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants