Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let Nokogiri parse only the necessary markup #74

Merged
merged 3 commits into from
May 2, 2018

Conversation

ashmaroli
Copy link
Member

Instead of having Nokogiri parse the entire HTML doc and then selectively apply the emojify_filter to just the contents of the <body> tag, let's have Nokogiri parse just the <body> tag descendants instead

The emojified markup is then inserted back into doc.output via String#<<

@DirtyF DirtyF requested a review from a team April 4, 2018 19:05
@DirtyF
Copy link
Member

DirtyF commented Apr 4, 2018

What difference does this make?

@ashmaroli
Copy link
Member Author

What this does is essentially telling Nokogiri that it need not parse the entire output but just the content between <body> and </body> (the content will be extracted by the plugin).

This results in lower build times since Nokogiri has to do less and can be more focused from the onset instead of having to read-in the bigger picture and then narrow the focus down to the <body> element.
The added advantage being Nokogiri will now not insert extra markup in the <head> of the HTML file.

@ashmaroli
Copy link
Member Author

similar approach had been proposed for jekyll-mentions as well..

Copy link
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! One question.

@@ -1,7 +1,6 @@
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin(s) will no longer add this extra markup to our output..
The passing script/cibuild testifies this..

lib/jemoji.rb Outdated
@@ -9,6 +9,7 @@ class Emoji
GITHUB_DOT_COM_ASSET_HOST_URL = "https://assets-cdn.github.com".freeze
ASSET_PATH = "/images/icons/".freeze
BODY_START_TAG = "<body".freeze
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does the .* need to be non-greedy? (not sure which is more fault tolerant)
  2. Does this need to be case insensitive to allow for <BODY tags?
  3. Could we simplify/collapse the existing logic with BODY_START_TAG?

Copy link
Member Author

@ashmaroli ashmaroli Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting questions @benbalter. I ran a benchmark script to reach a conclusion:
(Results from Ruby 2.3 on Windows)

require 'benchmark/ips'

BODY_START_TAG = "<body".freeze
LAZY_MATCH   = %r!#{BODY_START_TAG}(.*?)>\s*!
GREEDY_MATCH = %r!#{BODY_START_TAG}(.*)>\s*!
LAZY_MATCH_MIXCASE   = %r!#{BODY_START_TAG}(.*?)>\s*!i
GREEDY_MATCH_MIXCASE = %r!#{BODY_START_TAG}(.*)>\s*!i

# CONTENT = unmodified HTML output from one of the posts at jekyll/jekyll
# redacted for obvious reasons..

def lazy
  CONTENT.partition(LAZY_MATCH)
end

def greedy
  CONTENT.partition(GREEDY_MATCH)
end

def mixcase_lazy
  CONTENT.partition(LAZY_MATCH_MIXCASE)
end

def mixcase_greedy
  CONTENT.partition(GREEDY_MATCH_MIXCASE)
end

p lazy == greedy
p mixcase_lazy == mixcase_greedy
p lazy == mixcase_greedy

Benchmark.ips do |x|
  x.report('lazy') { lazy }
  x.report('greedy') { greedy }
  x.report('lazy-mixcase') { mixcase_lazy }
  x.report('greedy-mixcase') { mixcase_greedy }
  x.compare!
end
true
true
true
Warming up --------------------------------------
                lazy     5.655k i/100ms
              greedy     5.772k i/100ms
        lazy-mixcase     3.291k i/100ms
      greedy-mixcase     3.443k i/100ms
Calculating -------------------------------------
                lazy     63.678k (A± 0.9%) i/s -    322.335k in   5.062399s
              greedy     65.333k (A± 1.1%) i/s -    329.004k in   5.036390s
        lazy-mixcase     35.615k (A± 1.1%) i/s -    181.005k in   5.082811s
      greedy-mixcase     37.295k (A± 1.0%) i/s -    189.365k in   5.078017s

Comparison:
              greedy:    65333.3 i/s
                lazy:    63678.0 i/s - 1.03x  slower
      greedy-mixcase:    37295.1 i/s - 1.75x  slower
        lazy-mixcase:    35615.2 i/s - 1.83x  slower

Therefore,

  • I'm 👎 on making (.*) lazy
  • I'm unsure of modifying BODY_START_TAG (a String) into a Regexp
  • I'm 👎 on case-insensitivity because it is slower and because the output is only modified if doc.output.include? "<body" the behavior for a couple of years now..

@DirtyF DirtyF requested a review from a team May 1, 2018 11:11
@DirtyF
Copy link
Member

DirtyF commented May 2, 2018

@jekyllbot: merge +minor

@jekyllbot jekyllbot merged commit f5e2e08 into jekyll:master May 2, 2018
jekyllbot added a commit that referenced this pull request May 2, 2018
@ashmaroli ashmaroli deleted the parse-html-fragment branch May 2, 2018 11:02
@jekyll jekyll locked and limited conversation to collaborators May 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants