-
-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing ‘* * * * * * … a’ takes quadratic time #284
Comments
Got the culprit in MD4C. I bet it will be the same in Cmark: The line is scanned as thematic break until the final |
@mity good tip. Here's the difference it makes if we remove the check for thematic breaks entirely:
|
The question is how to fix this. (@mity, what did you do?)
I suppose that, before parsing any list items, one could parse from the end of the line to determine if it might end with a thematic break, and store that information in state. That would help with cases that end in |
I fixed by remembering the offset of character ( |
Note
|
Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See #284.
@mity I tried to implement your strategy in the issue284 branch. Is it possible that something else is to blame, perhaps in the HTML writer for nested lists? |
I'm seeing only a slight performance difference in this case with the custom thematic break scanner (< 10%, for a pathological input that should emphasize the difference). |
As noted in the linked issue, I suspect the culprit is actually |
Changing |
The loop in |
We should be able to memoize this: once a list item has been checked to see if it ends in a blank line, we shouldn't need to recurse into its subitems again. This should simply be recorded in the node. |
to avoid unnecessary repetition. Once we settle whether a list item ends in a blank line, we don't need to revisit this in considering parent list items. See #284.
Hm, that was a good theory, but it doesn't seem to be the issue here: we still have quadratic behavior even after fixing |
I do not know enough about cmark internals to comment. But I have just tried this in the current master/HEAD: $ echo ' * * * - - -' | ./src/cmark
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul> While in MD4C: $ echo ' * * * - - -' | md2html
<ul>
<li><ul>
<li><ul>
<li><hr>
</li>
</ul>
</li>
</ul>
</li>
</ul> Which means cmark failed to recognize there is thematic break nested in the list. |
So likely there is more then one bug. |
So as I see it:
|
Yes that point 3 is I guess, it is to detect loose versus tight list, right? In MD4C this is done on the fly during line analysis. I maintain a stack of currently "opened" container blocks. Every list is initially marked as tight one. When I encounter a blank line, I simply consult the stack and if its top is a list, then I set the flag making it a loose one. EDIT: It is little bit more complicated: Blank line sets a helper flag. Non-blank line which still belongs to the same list does the check whether the flag is set. This distinguishes blank lines inside/between list items versus after the list. See https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5917 |
I found the issue I was chasing earlier. Fix soon. But, about your case: why isn't cmark's output correct? |
Consistency: The priority of the interpretation should be the same inside the list as on the top level. So as long as this
is thematic break, it should be thematic break when inside the list. |
Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See #284.
OK, agreed. The code I just pushed solves this problem -- we recognize the inner hrule. It also solves the performance problem, through a combination of smarter 'blank-in-list' detection and a hand-rolled thematic break scanner with your thematic break kill position idea. Parsing time is now linear with input size for the |
And moral conclusion for me: Do not make bets without looking into source files... |
Your bet was right, though: It's just that it was only one of two problems. |
This fixes a case like commonmark/cmark#284.
to avoid unnecessary repetition. Once we settle whether a list item ends in a blank line, we don't need to revisit this in considering parent list items. See commonmark#284.
Use this to avoid unnecessary recursion in ends_with_blank_line. Closes #284.
Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See commonmark#284.
This flag was introduced by commonmark#284, but we will not need it once we update `S_ends_with_blank_line` to not use resursion in the next commit.
This flag was introduced by commonmark#284, but we will not need it once we update `S_ends_with_blank_line` to not use resursion in the next commit.
Related: jgm/commonmark-hs#2, mity/md4c#66.
The text was updated successfully, but these errors were encountered: