Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize header parsing #513

Closed
wants to merge 3 commits into from
Closed

Conversation

glebm
Copy link
Contributor

@glebm glebm commented May 18, 2018

Fixes #505

Benchmarks:

"setext":

ruby -rbenchmark -Ilib -rkramdown -e 'p Benchmark.measure{Kramdown::Document.new("1#{" "*20000}2\n==\n")}'

"atx":

ruby -rbenchmark -Ilib -rkramdown -e 'p Benchmark.measure{Kramdown::Document.new("## 1#{" "*20000}2")}'

@glebm glebm force-pushed the header-parsing branch 2 times, most recently from bc1c4f8 to 5fa05e5 Compare May 18, 2018 02:29
Fixes gettalong#505

Benchmarks:

"setext":

```bash
ruby -rbenchmark -Ilib -rkramdown -e 'p Benchmark.measure{Kramdown::Document.new("1#{" "*20000}2\n==\n")}'
```

"atx":

```bash
ruby -rbenchmark -Ilib -rkramdown -e 'p Benchmark.measure{Kramdown::Document.new("## 1#{" "*20000}2")}'
```
@glebm glebm force-pushed the header-parsing branch from 5fa05e5 to 9b35ff2 Compare May 18, 2018 12:33
Copy link

@krasnoukhov krasnoukhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this fixes a performance issue with no breaking changes. Good work!

Copy link
Owner

@gettalong gettalong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request and your great work!

Before I merge, please also remove the changes to .gitignore and remove the zero-byte test/testcases/block/04_header/atx_header.hcd file.

@src.check(ATX_HEADER_MATCH)
level, text, id = @src[1], @src[2].to_s.strip, @src[3]
text, id = parse_header_contents
text.sub!(/[\t ]#+\z/, '') && text.rstrip!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these two statements are on one line? I would rather prefer them on separate lines.

Copy link
Contributor Author

@glebm glebm May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to call rstrip! if sub! return nil.

The other options here are:

A:

text.rstrip! if text.sub!(...)

B:

if text.sub!(...)
  text.rstrip!
end

I find the current one (&&) to be the most readable but can go with A or B if that's what you prefer.

Copy link
Owner

@gettalong gettalong May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would just put them on separate lines and take this minimal performance hit because it would be easier to see what's going on without the additional conditional. But I'm okay with leaving this as it is.

@@ -13,47 +13,60 @@ module Kramdown
module Parser
class Kramdown

HEADER_ID=/(?:[ \t]+\{#([A-Za-z][\w:-]*)\})?/
SETEXT_HEADER_START = /^(#{OPT_SPACE}[^ \t].*?)#{HEADER_ID}[ \t]*?\n(-|=)+\s*?\n/
SETEXT_HEADER_START = /^#{OPT_SPACE}(?<contents>.*)\n(?<level>[-=])[-=]*[ \t\r\f\v]*\n/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you leave out [^ \t] in the new version? It is not strictly necessary for the main kramdown parser due to the fixed order of block parsers but behaviour of derived parser may change. As far as I can tell, it doesn't make a performance difference.

Also: What is the reason behind changing \s to [ \t\r\f\v]?

Copy link
Contributor Author

@glebm glebm May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you leave out [^ \t] in the new version?

Because it's not necessary as we have to validate header length later on anyway. I have now added it back in. No performance difference.

Also: What is the reason behind changing \s to [ \t\r\f\v]?

I find constructs like \s*?\n a bit cryptic. Performance-wise both are the same.

By the way, do we actually want to accept \r\f\v here? CommonMark spec only mentions trailing spaces (https://spec.commonmark.org/0.28/#example-54), and also allows OPT_SPACE for the underline.


HEADER_ID = /[\t ]{#(?<id>[A-Za-z][\w:-]*)}\z/

# @return [[String, String]] header text and optional ID.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use standard RDoc comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with RDoc, and although I've read the RDoc Markup Reference, it doesn't seem to mention the correct markup for type hints, so I'm not sure what to do here.

The format I've used is YARD, which is also widely supported by IDEs for type hints. Can you convert to the desired format when merging?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDoc doesn't have type hints.

Just leave out this API documentation (most of the other parsing methods don't have API documentation as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# @param [Number] level
# @param [String] text
# @param [String, nil] id
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use standard RDoc comments.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, just remove the API documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gettalong
Copy link
Owner

I just noticed that the PR doesn't include the test cases themselves. Could you add them or should I add them after I merge the changes?

And please squash the commits into one before I merge them - thank you!

@glebm
Copy link
Contributor Author

glebm commented May 26, 2018

I just noticed that the PR doesn't include the test cases themselves. Could you add them or should I add them after I merge the changes?

Please add them yourself, it will be faster this way.

And please squash the commits into one before I merge them - thank you!

Nowadays you can do it yourself at merge time by clicking "Squash and merge"!

@gettalong
Copy link
Owner

Merged - thanks!

@gettalong gettalong closed this May 26, 2018
@krasnoukhov
Copy link

@gettalong Will you cut a release please? Thank you

@gettalong
Copy link
Owner

Yes, if I have time this weekend.

@glebm glebm deleted the header-parsing branch May 31, 2018 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants