Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing * * * * * * … takes quadratic time #2

Closed
andersk opened this issue Mar 10, 2019 · 7 comments
Closed

Parsing * * * * * * … takes quadratic time #2

andersk opened this issue Mar 10, 2019 · 7 comments

Comments

@andersk
Copy link

andersk commented Mar 10, 2019

$ python3 -c 'print(end="* "*1000)' | time commonmark > /dev/null
0.46user 0.16system 0:00.35elapsed 177%CPU (0avgtext+0avgdata 52080maxresident)k
0inputs+0outputs (0major+15662minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*2000)' | time commonmark > /dev/null
1.45user 0.63system 0:01.07elapsed 193%CPU (0avgtext+0avgdata 52120maxresident)k
0inputs+0outputs (0major+16320minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*4000)' | time commonmark > /dev/null
6.27user 2.81system 0:04.52elapsed 201%CPU (0avgtext+0avgdata 52292maxresident)k
0inputs+0outputs (0major+18973minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*8000)' | time commonmark > /dev/null
35.88user 15.90system 0:25.93elapsed 199%CPU (0avgtext+0avgdata 51780maxresident)k
0inputs+0outputs (0major+25323minor)pagefaults 0swaps
@jgm
Copy link
Owner

jgm commented Mar 11, 2019

Note that the newline is suppressed at the end of line, so this gets parsed as a very deeply nested list item. One could also test with something like

python3 -c 'print("* "*8000 + "a")'

@jgm
Copy link
Owner

jgm commented Mar 11, 2019

Timings for cmark:

  • 4000 0.19
  • 8000 0.70
  • 12000 2.16
  • 16000 4.27

So cmark also shows quadratic behavior for this; the same issue could be reported there.

@jgm
Copy link
Owner

jgm commented Mar 11, 2019

There doesn't seem to be any backtracking in the parsing algorithm.
It could have to do with adding things to deeply nested structures.

@jgm
Copy link
Owner

jgm commented Mar 11, 2019

I suspect it has to do with the finalize step (I tried commenting out list finalizing in cmark, and it sped things up).

@jgm
Copy link
Owner

jgm commented Mar 18, 2019

There are two problems.

Problem 1 is the use of getBlanks in blockFinalize for listSpec in Blocks.hs.
This recurses through all the children at each level, leading to quadratic performance.
Unfortunately we can't just port over the change to cmark, since they rely on mutable block structure.

Problem 2 is that we repeatedly scan for thematic breaks, even after it should be clear that this is unnecessary (since on the first scan, we see that the a at the end of the line is incompatible with any thematic break). In cmark we fixed this by using a custom scanner that sets a "thematic break kill position" on the first failure; we then don't need to scan for thematic breaks again until we pass this kill position. Again, the specific change will have to be modified for pure code.

@jgm
Copy link
Owner

jgm commented Mar 18, 2019

We're finalizing the most deeply nested list first, then on up the line.
So it should be sufficient, for Problem 1, to store something in the list data for the finalized list indicating whether it ends in a blank line. (Indeed, this would be a cleaner solution for cmark, too.)

jgm added a commit that referenced this issue Mar 23, 2019
@jgm
Copy link
Owner

jgm commented Mar 24, 2019

Problem 1 solved. Problem 2 remains.

jgm added a commit that referenced this issue Mar 24, 2019
jgm added a commit that referenced this issue Mar 25, 2019
This is to store positions in a line where we know a scanner
will fail. The immediate application is to #2.
@jgm jgm closed this as completed in 53dd261 Mar 25, 2019
jgm added a commit that referenced this issue Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants