From 457152fa1bc29b762527677754d1f7600f8aa801 Mon Sep 17 00:00:00 2001 From: Mike Dalessio Date: Sun, 14 Mar 2021 10:18:42 -0400 Subject: [PATCH] feat: Nokogumbo detects Nokogiri's HTML5 API MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #170 A future version of Nokogiri will provide Nokogumbo's API (see https://github.com/sparklemotion/nokogiri/issues/2204). This change will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API and become a "shim" -- gracefully defer to Nokogiri by refusing to load itself. Some contractual assumptions I'm making about Nokogiri: - Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton method, module, and namespace (including classes `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and `Nokogiri::HTML5::DocumentFragment`) - Nokogiri will not provide a `::Nokogumbo` module/namespace, but will provide a similar `::Nokogiri::Gumbo` module which will provide the same constants and singleton methods as `::Nokogumbo`: - `Nokogumbo.parse()` will be provided as `Nokogiri::Gumbo.parse()` - `Nokogumbo.fragment()` → `Nokogiri::Gumbo.fragment()` - `Nokogumbo::DEFAULT_MAX_ATTRIBUTES` → `Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES` - `Nokogumbo::DEFAULT_MAX_ERRORS` → `Nokogiri::Gumbo::DEFAULT_MAX_ERRORS` - `Nokogumbo::DEFAULT_MAX_TREE_DEPTH` → `Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH` This change checks for the existence of `Nokogiri::HTML5`, `Nokogiri::Gumbo`, and an expected singleton method on each. We could do a more- or less-thorough check here. This change also provides an "escape hatch" using an environment variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to avoid the "shim" behavior. This escape hatch might be unnecessary, but this change is invasive enough to make me want to be cautious. In "shim" mode, `Nokogumbo.parse()` and `.fragment()` will be forwarded to the Nokogiri implementation. The `Nokogumbo::DEFAULT*` constants will always be defined, but when in "shim" mode will be set to the `Nokogiri`-provided values. Nokogumbo will emit a single warning message at `require`-time when it is in "shim" mode. This message points users to https://github.com/sparklemotion/nokogiri/issues/2205 which will explain what's going on and help people migrate their applications (but is an empty placeholder right now). I did not include deprecation warning messages in `Nokogumbo.parse` and `.fragment`. If you feel strongly that we should, let me know. --- lib/nokogumbo.rb | 41 +++++++++++++++++++++++++++++++---------- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/lib/nokogumbo.rb b/lib/nokogumbo.rb index 262d45b9..ce1679c5 100644 --- a/lib/nokogumbo.rb +++ b/lib/nokogumbo.rb @@ -1,17 +1,38 @@ require 'nokogiri' require 'nokogumbo/version' -require 'nokogumbo/html5' -require 'nokogumbo/nokogumbo' +if ((defined?(Nokogiri::HTML5) && Nokogiri::HTML5.respond_to?(:parse)) && + (defined?(Nokogiri::Gumbo) && Nokogiri::Gumbo.respond_to?(:parse)) && + !(ENV.key?("NOKOGUMBO_IGNORE_NOKOGIRI_HTML5") && ENV["NOKOGUMBO_IGNORE_NOKOGIRI_HTML5"] != "false")) -module Nokogumbo - # The default maximum number of attributes per element. - DEFAULT_MAX_ATTRIBUTES = 400 + warn "NOTE: nokogumbo: Using Nokogiri::HTML5 provided by Nokogiri. See https://github.com/sparklemotion/nokogiri/issues/2205 for more information." - # The default maximum number of errors for parsing a document or a fragment. - DEFAULT_MAX_ERRORS = 0 + module Nokogumbo + def self.parse(*args) + Nokogiri::Gumbo.parse(*args) + end - # The default maximum depth of the DOM tree produced by parsing a document - # or fragment. - DEFAULT_MAX_TREE_DEPTH = 400 + def self.fragment(*args) + Nokogiri::Gumbo.fragment(*args) + end + + DEFAULT_MAX_ATTRIBUTES = Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES + DEFAULT_MAX_ERRORS = Nokogiri::Gumbo::DEFAULT_MAX_ERRORS + DEFAULT_MAX_TREE_DEPTH = Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH + end +else + require 'nokogumbo/html5' + require 'nokogumbo/nokogumbo' + + module Nokogumbo + # The default maximum number of attributes per element. + DEFAULT_MAX_ATTRIBUTES = 400 + + # The default maximum number of errors for parsing a document or a fragment. + DEFAULT_MAX_ERRORS = 0 + + # The default maximum depth of the DOM tree produced by parsing a document + # or fragment. + DEFAULT_MAX_TREE_DEPTH = 400 + end end