-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let Nokogiri parse only the necessary markup #74
Conversation
What difference does this make? |
What this does is essentially telling Nokogiri that it need not parse the entire output but just the content between This results in lower build times since Nokogiri has to do less and can be more focused from the onset instead of having to read-in the bigger picture and then narrow the focus down to the |
similar approach had been proposed for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! One question.
@@ -1,7 +1,6 @@ | |||
<!DOCTYPE HTML> | |||
<html lang="en-US"> | |||
<head> | |||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin(s) will no longer add this extra markup to our output..
The passing script/cibuild
testifies this..
lib/jemoji.rb
Outdated
@@ -9,6 +9,7 @@ class Emoji | |||
GITHUB_DOT_COM_ASSET_HOST_URL = "https://assets-cdn.github.com".freeze | |||
ASSET_PATH = "/images/icons/".freeze | |||
BODY_START_TAG = "<body".freeze | |||
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does the
.*
need to be non-greedy? (not sure which is more fault tolerant) - Does this need to be case insensitive to allow for
<BODY
tags? - Could we simplify/collapse the existing logic with
BODY_START_TAG
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting questions @benbalter. I ran a benchmark script to reach a conclusion:
(Results from Ruby 2.3 on Windows)
require 'benchmark/ips'
BODY_START_TAG = "<body".freeze
LAZY_MATCH = %r!#{BODY_START_TAG}(.*?)>\s*!
GREEDY_MATCH = %r!#{BODY_START_TAG}(.*)>\s*!
LAZY_MATCH_MIXCASE = %r!#{BODY_START_TAG}(.*?)>\s*!i
GREEDY_MATCH_MIXCASE = %r!#{BODY_START_TAG}(.*)>\s*!i
# CONTENT = unmodified HTML output from one of the posts at jekyll/jekyll
# redacted for obvious reasons..
def lazy
CONTENT.partition(LAZY_MATCH)
end
def greedy
CONTENT.partition(GREEDY_MATCH)
end
def mixcase_lazy
CONTENT.partition(LAZY_MATCH_MIXCASE)
end
def mixcase_greedy
CONTENT.partition(GREEDY_MATCH_MIXCASE)
end
p lazy == greedy
p mixcase_lazy == mixcase_greedy
p lazy == mixcase_greedy
Benchmark.ips do |x|
x.report('lazy') { lazy }
x.report('greedy') { greedy }
x.report('lazy-mixcase') { mixcase_lazy }
x.report('greedy-mixcase') { mixcase_greedy }
x.compare!
end
true
true
true
Warming up --------------------------------------
lazy 5.655k i/100ms
greedy 5.772k i/100ms
lazy-mixcase 3.291k i/100ms
greedy-mixcase 3.443k i/100ms
Calculating -------------------------------------
lazy 63.678k (A± 0.9%) i/s - 322.335k in 5.062399s
greedy 65.333k (A± 1.1%) i/s - 329.004k in 5.036390s
lazy-mixcase 35.615k (A± 1.1%) i/s - 181.005k in 5.082811s
greedy-mixcase 37.295k (A± 1.0%) i/s - 189.365k in 5.078017s
Comparison:
greedy: 65333.3 i/s
lazy: 63678.0 i/s - 1.03x slower
greedy-mixcase: 37295.1 i/s - 1.75x slower
lazy-mixcase: 35615.2 i/s - 1.83x slower
Therefore,
- I'm 👎 on making
(.*)
lazy - I'm unsure of modifying
BODY_START_TAG
(aString
) into aRegexp
- I'm 👎 on case-insensitivity because it is slower and because the
output
is only modifiedif doc.output.include? "<body"
the behavior for a couple of years now..
ba05dfd
to
a24f687
Compare
@jekyllbot: merge +minor |
Instead of having Nokogiri parse the entire HTML doc and then selectively apply the emojify_filter to just the contents of the
<body>
tag, let's have Nokogiri parse just the<body>
tag descendants insteadThe emojified markup is then inserted back into
doc.output
viaString#<<