Skip to content

Process HTML using Nokolexbor instead of Nokogumbo #3043

Open
@ilyazub

Description

I used the benchmark from #2722 with Ruby 3.2.2 and 2.7.2, and added Nokolexbor to the benchmark.

Nokolexbor is 2-12 times faster when parsing and 2-6 times faster when serializing than the Gumbo and Libxml2 backends.

#! /usr/bin/env ruby
# coding: utf-8

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "nokolexbor"
  gem "benchmark-ips"
end

require "nokogiri"
require "nokolexbor"
require "benchmark/ips"

filenames = [
  "test/files/GH_1042.html", # 650b
  "test/files/tlm.html", # 70kb
  "big_shopping.html", # 1.9mb
]

inputs = filenames.map { |fn| File.read(fn) }

puts RUBY_DESCRIPTION

inputs.each do |input|
  len = input.length

  Benchmark.ips do |x|
    x.warmup = 0
    x.time = 10

    x.report("html5 parse #{len}") do
      Nokogiri::HTML5::Document.parse(input)
    end
    x.report("html4 parse #{len}") do
      Nokogiri::HTML4::Document.parse(input)
    end
    x.report("nokolexbor html5 parse #{len}") do
      Nokolexbor::HTML(input)
    end
    x.compare!
  end
end

puts "=========="

inputs.each do |input|
  len = input.length
  html4_doc = Nokogiri::HTML4::Document.parse(input)
  html5_doc = Nokogiri::HTML5::Document.parse(input)
  html5_doc_nokolexbor = Nokolexbor::HTML(input)

  Benchmark.ips do |x|
    x.warmup = 0
    x.time = 10

    x.report("html5 serlz #{len}") do
      html5_doc.to_html
    end
    x.report("html4 serlz #{len}") do
      html4_doc.to_html
    end
    x.report("html5 nokolexbor serlz #{len}") do
      html5_doc_nokolexbor.to_html
    end
    x.compare!
  end
end
ruby 2.7.2 benchmark
$ ruby bench.rb
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Calculating -------------------------------------
     html5 parse 656     21.049k (±23.5%) i/s -    179.547k in   9.929976s
     html4 parse 656     22.142k (±22.3%) i/s -    189.923k in   9.926466s
nokolexbor html5 parse 656
                         43.945k (±21.3%) i/s -    296.049k in   9.900173s

Comparison:
nokolexbor html5 parse 656:    43944.8 i/s
     html4 parse 656:    22141.7 i/s - 1.98x  (± 0.00) slower
     html5 parse 656:    21048.8 i/s - 2.09x  (± 0.00) slower

Calculating -------------------------------------
   html5 parse 70095    300.102  (±18.7%) i/s -      2.684k in   9.997238s
   html4 parse 70095    450.409  (±22.6%) i/s -      3.978k in   9.997504s
nokolexbor html5 parse 70095
                          1.406k (±20.4%) i/s -     13.083k in   9.984839s

Comparison:
nokolexbor html5 parse 70095:     1405.6 i/s
   html4 parse 70095:      450.4 i/s - 3.12x  (± 0.00) slower
   html5 parse 70095:      300.1 i/s - 4.68x  (± 0.00) slower

Calculating -------------------------------------
 html5 parse 1929522     13.132  (± 7.6%) i/s -    131.000  in  10.075865s
 html4 parse 1929522     37.880  (±13.2%) i/s -    370.000  in  10.017928s
nokolexbor html5 parse 1929522
                        157.773  (± 9.5%) i/s -      1.561k in   9.999853s

Comparison:
nokolexbor html5 parse 1929522:      157.8 i/s
 html4 parse 1929522:       37.9 i/s - 4.17x  (± 0.00) slower
 html5 parse 1929522:       13.1 i/s - 12.01x  (± 0.00) slower

==========
Calculating -------------------------------------
     html5 serlz 656     40.303k (±17.2%) i/s -    373.898k in   9.891472s
     html4 serlz 656     53.260k (±18.3%) i/s -    484.973k in   9.844606s
html5 nokolexbor serlz 656
                        263.888k (±15.5%) i/s -      2.270M in   9.493963s

Comparison:
html5 nokolexbor serlz 656:   263887.5 i/s
     html4 serlz 656:    53260.0 i/s - 4.95x  (± 0.00) slower
     html5 serlz 656:    40303.4 i/s - 6.55x  (± 0.00) slower

Calculating -------------------------------------
   html5 serlz 70095    918.855  (±15.1%) i/s -      8.842k in   9.993063s
   html4 serlz 70095      1.112k (±13.3%) i/s -     10.828k in   9.992264s
html5 nokolexbor serlz 70095
                          3.359k (±14.9%) i/s -     32.417k in   9.985435s

Comparison:
html5 nokolexbor serlz 70095:     3358.8 i/s
   html4 serlz 70095:     1112.0 i/s - 3.02x  (± 0.00) slower
   html5 serlz 70095:      918.9 i/s - 3.66x  (± 0.00) slower

Calculating -------------------------------------
 html5 serlz 1929522    107.234  (±12.1%) i/s -      1.055k in  10.007869s
 html4 serlz 1929522    115.701  (±11.2%) i/s -      1.140k in   9.999178s
html5 nokolexbor serlz 1929522
                        425.103  (±19.8%) i/s -      4.042k in   9.994780s

Comparison:
html5 nokolexbor serlz 1929522:      425.1 i/s
 html4 serlz 1929522:      115.7 i/s - 3.67x  (± 0.00) slower
 html5 serlz 1929522:      107.2 i/s - 3.96x  (± 0.00) slower
ruby 3.2.2 benchmark
$ ruby ./bench.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
Calculating -------------------------------------
     html5 parse 656     21.030k (±18.5%) i/s -    170.856k
     html4 parse 656     21.118k (±18.9%) i/s -    172.096k in   9.886192s
nokolexbor html5 parse 656
                         38.215k (±24.8%) i/s -    243.899k in   9.856369s

Comparison:
nokolexbor html5 parse 656:    38214.8 i/s
     html4 parse 656:    21118.4 i/s - 1.81x  slower
     html5 parse 656:    21029.8 i/s - 1.82x  slower

Calculating -------------------------------------
   html5 parse 70095    275.828  (±21.0%) i/s -      2.421k in   9.996074s
   html4 parse 70095    439.891  (±20.9%) i/s -      3.646k in   9.995517s
nokolexbor html5 parse 70095
                          1.467k (±18.5%) i/s -     13.797k in   9.983325s

Comparison:
nokolexbor html5 parse 70095:     1466.9 i/s
   html4 parse 70095:      439.9 i/s - 3.33x  slower
   html5 parse 70095:      275.8 i/s - 5.32x  slower

Calculating -------------------------------------
 html5 parse 1929522     12.321  (± 8.1%) i/s -    122.000  in  10.067774s
 html4 parse 1929522     36.420  (±19.2%) i/s -    351.000  in  10.018349s
nokolexbor html5 parse 1929522
                        146.070  (±15.1%) i/s -      1.423k in  10.001315s

Comparison:
nokolexbor html5 parse 1929522:      146.1 i/s
 html4 parse 1929522:       36.4 i/s - 4.01x  slower
 html5 parse 1929522:       12.3 i/s - 11.86x  slower

==========
Calculating -------------------------------------
     html5 serlz 656     39.037k (±22.6%) i/s -    335.023k in   9.824201s
     html4 serlz 656     52.522k (±21.3%) i/s -    452.027k in   9.742767s
html5 nokolexbor serlz 656
                        260.432k (±19.0%) i/s -      2.064M in   9.155473s

Comparison:
html5 nokolexbor serlz 656:   260432.1 i/s
     html4 serlz 656:    52521.9 i/s - 4.96x  slower
     html5 serlz 656:    39037.3 i/s - 6.67x  slower

Calculating -------------------------------------
   html5 serlz 70095    950.690  (±15.6%) i/s -      9.173k in   9.989867s
   html4 serlz 70095      1.049k (±15.6%) i/s -     10.090k in   9.988001s
html5 nokolexbor serlz 70095
                          3.464k (±16.9%) i/s -     32.979k in   9.976496s

Comparison:
html5 nokolexbor serlz 70095:     3464.2 i/s
   html4 serlz 70095:     1049.5 i/s - 3.30x  slower
   html5 serlz 70095:      950.7 i/s - 3.64x  slower

Calculating -------------------------------------
 html5 serlz 1929522    114.167  (± 9.6%) i/s -      1.130k in  10.002443s
 html4 serlz 1929522    112.654  (±12.4%) i/s -      1.107k in  10.006577s
html5 nokolexbor serlz 1929522
                        412.097  (±18.9%) i/s -      3.934k in   9.992725s

Comparison:
html5 nokolexbor serlz 1929522:      412.1 i/s
 html5 serlz 1929522:      114.2 i/s - 3.61x  slower
 html4 serlz 1929522:      112.7 i/s - 3.66x  slower

@flavorjones, thank you for following up and checking Nokolexbor! ♥️ What incompatibilities did you notice and what do you think about the Lexbor library usage in Nokogiri?

/cc @zyc9012

Metadata

Assignees

No one assigned

    Labels

    state/needs-triageInbox for non-installation-related bug reports or help requests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions