Process HTML using Nokolexbor instead of Nokogumbo #3043
Open
Description
I used the benchmark from #2722 with Ruby 3.2.2 and 2.7.2, and added Nokolexbor to the benchmark.
Nokolexbor is 2-12 times faster when parsing and 2-6 times faster when serializing than the Gumbo and Libxml2 backends.
#! /usr/bin/env ruby
# coding: utf-8
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", path: "."
gem "nokolexbor"
gem "benchmark-ips"
end
require "nokogiri"
require "nokolexbor"
require "benchmark/ips"
filenames = [
"test/files/GH_1042.html", # 650b
"test/files/tlm.html", # 70kb
"big_shopping.html", # 1.9mb
]
inputs = filenames.map { |fn| File.read(fn) }
puts RUBY_DESCRIPTION
inputs.each do |input|
len = input.length
Benchmark.ips do |x|
x.warmup = 0
x.time = 10
x.report("html5 parse #{len}") do
Nokogiri::HTML5::Document.parse(input)
end
x.report("html4 parse #{len}") do
Nokogiri::HTML4::Document.parse(input)
end
x.report("nokolexbor html5 parse #{len}") do
Nokolexbor::HTML(input)
end
x.compare!
end
end
puts "=========="
inputs.each do |input|
len = input.length
html4_doc = Nokogiri::HTML4::Document.parse(input)
html5_doc = Nokogiri::HTML5::Document.parse(input)
html5_doc_nokolexbor = Nokolexbor::HTML(input)
Benchmark.ips do |x|
x.warmup = 0
x.time = 10
x.report("html5 serlz #{len}") do
html5_doc.to_html
end
x.report("html4 serlz #{len}") do
html4_doc.to_html
end
x.report("html5 nokolexbor serlz #{len}") do
html5_doc_nokolexbor.to_html
end
x.compare!
end
end
ruby 2.7.2 benchmark
$ ruby bench.rb
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Calculating -------------------------------------
html5 parse 656 21.049k (±23.5%) i/s - 179.547k in 9.929976s
html4 parse 656 22.142k (±22.3%) i/s - 189.923k in 9.926466s
nokolexbor html5 parse 656
43.945k (±21.3%) i/s - 296.049k in 9.900173s
Comparison:
nokolexbor html5 parse 656: 43944.8 i/s
html4 parse 656: 22141.7 i/s - 1.98x (± 0.00) slower
html5 parse 656: 21048.8 i/s - 2.09x (± 0.00) slower
Calculating -------------------------------------
html5 parse 70095 300.102 (±18.7%) i/s - 2.684k in 9.997238s
html4 parse 70095 450.409 (±22.6%) i/s - 3.978k in 9.997504s
nokolexbor html5 parse 70095
1.406k (±20.4%) i/s - 13.083k in 9.984839s
Comparison:
nokolexbor html5 parse 70095: 1405.6 i/s
html4 parse 70095: 450.4 i/s - 3.12x (± 0.00) slower
html5 parse 70095: 300.1 i/s - 4.68x (± 0.00) slower
Calculating -------------------------------------
html5 parse 1929522 13.132 (± 7.6%) i/s - 131.000 in 10.075865s
html4 parse 1929522 37.880 (±13.2%) i/s - 370.000 in 10.017928s
nokolexbor html5 parse 1929522
157.773 (± 9.5%) i/s - 1.561k in 9.999853s
Comparison:
nokolexbor html5 parse 1929522: 157.8 i/s
html4 parse 1929522: 37.9 i/s - 4.17x (± 0.00) slower
html5 parse 1929522: 13.1 i/s - 12.01x (± 0.00) slower
==========
Calculating -------------------------------------
html5 serlz 656 40.303k (±17.2%) i/s - 373.898k in 9.891472s
html4 serlz 656 53.260k (±18.3%) i/s - 484.973k in 9.844606s
html5 nokolexbor serlz 656
263.888k (±15.5%) i/s - 2.270M in 9.493963s
Comparison:
html5 nokolexbor serlz 656: 263887.5 i/s
html4 serlz 656: 53260.0 i/s - 4.95x (± 0.00) slower
html5 serlz 656: 40303.4 i/s - 6.55x (± 0.00) slower
Calculating -------------------------------------
html5 serlz 70095 918.855 (±15.1%) i/s - 8.842k in 9.993063s
html4 serlz 70095 1.112k (±13.3%) i/s - 10.828k in 9.992264s
html5 nokolexbor serlz 70095
3.359k (±14.9%) i/s - 32.417k in 9.985435s
Comparison:
html5 nokolexbor serlz 70095: 3358.8 i/s
html4 serlz 70095: 1112.0 i/s - 3.02x (± 0.00) slower
html5 serlz 70095: 918.9 i/s - 3.66x (± 0.00) slower
Calculating -------------------------------------
html5 serlz 1929522 107.234 (±12.1%) i/s - 1.055k in 10.007869s
html4 serlz 1929522 115.701 (±11.2%) i/s - 1.140k in 9.999178s
html5 nokolexbor serlz 1929522
425.103 (±19.8%) i/s - 4.042k in 9.994780s
Comparison:
html5 nokolexbor serlz 1929522: 425.1 i/s
html4 serlz 1929522: 115.7 i/s - 3.67x (± 0.00) slower
html5 serlz 1929522: 107.2 i/s - 3.96x (± 0.00) slower
ruby 3.2.2 benchmark
$ ruby ./bench.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
Calculating -------------------------------------
html5 parse 656 21.030k (±18.5%) i/s - 170.856k
html4 parse 656 21.118k (±18.9%) i/s - 172.096k in 9.886192s
nokolexbor html5 parse 656
38.215k (±24.8%) i/s - 243.899k in 9.856369s
Comparison:
nokolexbor html5 parse 656: 38214.8 i/s
html4 parse 656: 21118.4 i/s - 1.81x slower
html5 parse 656: 21029.8 i/s - 1.82x slower
Calculating -------------------------------------
html5 parse 70095 275.828 (±21.0%) i/s - 2.421k in 9.996074s
html4 parse 70095 439.891 (±20.9%) i/s - 3.646k in 9.995517s
nokolexbor html5 parse 70095
1.467k (±18.5%) i/s - 13.797k in 9.983325s
Comparison:
nokolexbor html5 parse 70095: 1466.9 i/s
html4 parse 70095: 439.9 i/s - 3.33x slower
html5 parse 70095: 275.8 i/s - 5.32x slower
Calculating -------------------------------------
html5 parse 1929522 12.321 (± 8.1%) i/s - 122.000 in 10.067774s
html4 parse 1929522 36.420 (±19.2%) i/s - 351.000 in 10.018349s
nokolexbor html5 parse 1929522
146.070 (±15.1%) i/s - 1.423k in 10.001315s
Comparison:
nokolexbor html5 parse 1929522: 146.1 i/s
html4 parse 1929522: 36.4 i/s - 4.01x slower
html5 parse 1929522: 12.3 i/s - 11.86x slower
==========
Calculating -------------------------------------
html5 serlz 656 39.037k (±22.6%) i/s - 335.023k in 9.824201s
html4 serlz 656 52.522k (±21.3%) i/s - 452.027k in 9.742767s
html5 nokolexbor serlz 656
260.432k (±19.0%) i/s - 2.064M in 9.155473s
Comparison:
html5 nokolexbor serlz 656: 260432.1 i/s
html4 serlz 656: 52521.9 i/s - 4.96x slower
html5 serlz 656: 39037.3 i/s - 6.67x slower
Calculating -------------------------------------
html5 serlz 70095 950.690 (±15.6%) i/s - 9.173k in 9.989867s
html4 serlz 70095 1.049k (±15.6%) i/s - 10.090k in 9.988001s
html5 nokolexbor serlz 70095
3.464k (±16.9%) i/s - 32.979k in 9.976496s
Comparison:
html5 nokolexbor serlz 70095: 3464.2 i/s
html4 serlz 70095: 1049.5 i/s - 3.30x slower
html5 serlz 70095: 950.7 i/s - 3.64x slower
Calculating -------------------------------------
html5 serlz 1929522 114.167 (± 9.6%) i/s - 1.130k in 10.002443s
html4 serlz 1929522 112.654 (±12.4%) i/s - 1.107k in 10.006577s
html5 nokolexbor serlz 1929522
412.097 (±18.9%) i/s - 3.934k in 9.992725s
Comparison:
html5 nokolexbor serlz 1929522: 412.1 i/s
html5 serlz 1929522: 114.2 i/s - 3.61x slower
html4 serlz 1929522: 112.7 i/s - 3.66x slower
@flavorjones, thank you for following up and checking Nokolexbor!
/cc @zyc9012