Let Nokogiri parse only the necessary markup #74

ashmaroli · 2018-03-21T17:21:09Z

Instead of having Nokogiri parse the entire HTML doc and then selectively apply the emojify_filter to just the contents of the <body> tag, let's have Nokogiri parse just the <body> tag descendants instead

The emojified markup is then inserted back into doc.output via String#<<

DirtyF · 2018-04-04T19:06:51Z

What difference does this make?

ashmaroli · 2018-04-05T02:38:25Z

What this does is essentially telling Nokogiri that it need not parse the entire output but just the content between <body> and </body> (the content will be extracted by the plugin).

This results in lower build times since Nokogiri has to do less and can be more focused from the onset instead of having to read-in the bigger picture and then narrow the focus down to the <body> element.
The added advantage being Nokogiri will now not insert extra markup in the <head> of the HTML file.

ashmaroli · 2018-04-05T02:40:52Z

similar approach had been proposed for jekyll-mentions as well..

parkr

Looking good! One question.

parkr · 2018-04-25T15:07:02Z

spec/fixtures/index.html

@@ -1,7 +1,6 @@
 <!DOCTYPE HTML>
 <html lang="en-US">
  <head>
-<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">


Why did you remove this?

The plugin(s) will no longer add this extra markup to our output..
The passing script/cibuild testifies this..

benbalter · 2018-04-25T15:24:34Z

lib/jemoji.rb

@@ -9,6 +9,7 @@ class Emoji
    GITHUB_DOT_COM_ASSET_HOST_URL = "https://assets-cdn.github.com".freeze
    ASSET_PATH = "/images/icons/".freeze
    BODY_START_TAG = "<body".freeze
+    OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!


Does the .* need to be non-greedy? (not sure which is more fault tolerant)

Does this need to be case insensitive to allow for <BODY tags?

Could we simplify/collapse the existing logic with BODY_START_TAG?

Interesting questions @benbalter. I ran a benchmark script to reach a conclusion:
(Results from Ruby 2.3 on Windows)

require 'benchmark/ips' BODY_START_TAG = "<body".freeze LAZY_MATCH = %r!#{BODY_START_TAG}(.*?)>\s*! GREEDY_MATCH = %r!#{BODY_START_TAG}(.*)>\s*! LAZY_MATCH_MIXCASE = %r!#{BODY_START_TAG}(.*?)>\s*!i GREEDY_MATCH_MIXCASE = %r!#{BODY_START_TAG}(.*)>\s*!i # CONTENT = unmodified HTML output from one of the posts at jekyll/jekyll # redacted for obvious reasons.. def lazy CONTENT.partition(LAZY_MATCH) end def greedy CONTENT.partition(GREEDY_MATCH) end def mixcase_lazy CONTENT.partition(LAZY_MATCH_MIXCASE) end def mixcase_greedy CONTENT.partition(GREEDY_MATCH_MIXCASE) end p lazy == greedy p mixcase_lazy == mixcase_greedy p lazy == mixcase_greedy Benchmark.ips do |x| x.report('lazy') { lazy } x.report('greedy') { greedy } x.report('lazy-mixcase') { mixcase_lazy } x.report('greedy-mixcase') { mixcase_greedy } x.compare! end

true true true Warming up -------------------------------------- lazy 5.655k i/100ms greedy 5.772k i/100ms lazy-mixcase 3.291k i/100ms greedy-mixcase 3.443k i/100ms Calculating ------------------------------------- lazy 63.678k (A± 0.9%) i/s - 322.335k in 5.062399s greedy 65.333k (A± 1.1%) i/s - 329.004k in 5.036390s lazy-mixcase 35.615k (A± 1.1%) i/s - 181.005k in 5.082811s greedy-mixcase 37.295k (A± 1.0%) i/s - 189.365k in 5.078017s Comparison: greedy: 65333.3 i/s lazy: 63678.0 i/s - 1.03x slower greedy-mixcase: 37295.1 i/s - 1.75x slower lazy-mixcase: 35615.2 i/s - 1.83x slower

Therefore,

I'm 👎 on making (.*) lazy

I'm unsure of modifying BODY_START_TAG (a String) into a Regexp

I'm 👎 on case-insensitivity because it is slower and because the output is only modified if doc.output.include? "<body" the behavior for a couple of years now..

DirtyF · 2018-05-02T11:00:30Z

@jekyllbot: merge +minor

let Nokogiri parse only the necessary markup

a24f687

DirtyF requested a review from a team April 4, 2018 19:05

parkr reviewed Apr 25, 2018

View reviewed changes

parkr approved these changes Apr 25, 2018

View reviewed changes

benbalter reviewed Apr 25, 2018

View reviewed changes

benbalter mentioned this pull request Apr 25, 2018

Parse only content necessary to mentionify doc jekyll/jekyll-mentions#59

Merged

ashmaroli force-pushed the parse-html-fragment branch 2 times, most recently from ba05dfd to a24f687 Compare April 25, 2018 19:11

Merge branch 'master' into parse-html-fragment

8c9d93a

DirtyF requested a review from a team May 1, 2018 11:11

Merge branch 'master' into parse-html-fragment

29ecc1b

jekyllbot merged commit f5e2e08 into jekyll:master May 2, 2018

jekyllbot added the enhancement label May 2, 2018

jekyllbot added a commit that referenced this pull request May 2, 2018

Update history to reflect merge of #74 [ci skip]

b7cf2bc

ashmaroli deleted the parse-html-fragment branch May 2, 2018 11:02

parkr mentioned this pull request Aug 3, 2018

update from 0.9.0 to 0.10.0 breaks emojis #80

Closed

ashmaroli mentioned this pull request Aug 3, 2018

Do not match the body tag eagerly #82

Merged

jekyll locked and limited conversation to collaborators May 2, 2019

jekyllbot added the frozen-due-to-age label May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let Nokogiri parse only the necessary markup #74

Let Nokogiri parse only the necessary markup #74

ashmaroli commented Mar 21, 2018

DirtyF commented Apr 4, 2018

ashmaroli commented Apr 5, 2018

ashmaroli commented Apr 5, 2018

parkr left a comment

parkr Apr 25, 2018

ashmaroli Apr 25, 2018

benbalter Apr 25, 2018

ashmaroli Apr 25, 2018 •

edited

Loading

DirtyF commented May 2, 2018

Let Nokogiri parse only the necessary markup #74

Let Nokogiri parse only the necessary markup #74

Conversation

ashmaroli commented Mar 21, 2018

DirtyF commented Apr 4, 2018

ashmaroli commented Apr 5, 2018

ashmaroli commented Apr 5, 2018

parkr left a comment

Choose a reason for hiding this comment

parkr Apr 25, 2018

Choose a reason for hiding this comment

ashmaroli Apr 25, 2018

Choose a reason for hiding this comment

benbalter Apr 25, 2018

Choose a reason for hiding this comment

ashmaroli Apr 25, 2018 • edited Loading

Choose a reason for hiding this comment

DirtyF commented May 2, 2018

ashmaroli Apr 25, 2018 •

edited

Loading