Nokogiri::XML::Node#add_child_node_and_reparent_attrs
behaves incorrectly if an attribute name has a colon #1790
Description
What problems are you experiencing?
Nokogiri::XML::Node#add_child_node_and_reparent_attrs
uses an incorrect test to see if a node's namespaces need to be reparented (at least, I think that's the purpose of this code).
It's testing a.name =~ /:/
to decide to reparent. This breaks HTML elements which are allowed to have colons in their attribute names. (Essentially, HTML only allows foreign elements to have explicit namespaces.)
It's a little convoluted to demonstrate the problem using the Nokogiri API.
require 'nokogiri'
doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'
puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)
This prints
#<Nokogiri::XML::Attr:0x3ff75a84b6d8 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3ff75a84b340 name="lang" namespace=#<Nokogiri::XML::Namespace:0x3ff75a84b2dc prefix="xml" href="http://www.w3.org/XML/1998/namespace"> value="en">
I note that using doc.root = html
doesn't do this reparenting.
Here's a backwards compatible fix.
require 'nokogiri'
module Nokogiri
module XML
class Node
# HTML elements can have attributes that contain colons.
# Nokogiri::XML::Node#[]= treats names with colons as a prefixed QName
# and tries to create an attribute in a namespace. This is especially
# annoying with attribute names like xml:lang since libxml2 will
# actually create the xml namespace if it doesn't exist already.
def add_child_node_and_reparent_attrs node
add_child_node(node)
node.attribute_nodes.find_all { |a| a.namespace }.each do |attr|
attr.remove
node[attr.name] = attr.value
end
end
end
end
end
doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'
puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)
This prints
#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">
Ideally, an API for manipulating attributes in a given namespace that doesn't do the parsing based on colon (e.g., uses xmlNewProp
rather than xmlSetProp
) would be fantastic.
What's the output from nokogiri -v
?
# Nokogiri (1.8.4)
---
warnings: []
nokogiri: 1.8.4
ruby:
version: 2.4.4
platform: x86_64-darwin17
description: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-darwin17]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.8"
libxslt_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.32"
libxml2_patches:
- 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
libxslt_patches: []
compiled: 2.9.8
loaded: 2.9.8
Can you provide a self-contained script that reproduces what you're seeing?