Skip to content

Nokogiri::XML::Node#add_child_node_and_reparent_attrs behaves incorrectly if an attribute name has a colon #1790

Open
@stevecheckoway

Description

What problems are you experiencing?
Nokogiri::XML::Node#add_child_node_and_reparent_attrs uses an incorrect test to see if a node's namespaces need to be reparented (at least, I think that's the purpose of this code).

It's testing a.name =~ /:/ to decide to reparent. This breaks HTML elements which are allowed to have colons in their attribute names. (Essentially, HTML only allows foreign elements to have explicit namespaces.)

It's a little convoluted to demonstrate the problem using the Nokogiri API.

require 'nokogiri'

doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'

puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)

This prints

#<Nokogiri::XML::Attr:0x3ff75a84b6d8 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3ff75a84b340 name="lang" namespace=#<Nokogiri::XML::Namespace:0x3ff75a84b2dc prefix="xml" href="http://www.w3.org/XML/1998/namespace"> value="en">

I note that using doc.root = html doesn't do this reparenting.

Here's a backwards compatible fix.

require 'nokogiri'

module Nokogiri
  module XML
    class Node
      # HTML elements can have attributes that contain colons.
      # Nokogiri::XML::Node#[]= treats names with colons as a prefixed QName
      # and tries to create an attribute in a namespace. This is especially
      # annoying with attribute names like xml:lang since libxml2 will
      # actually create the xml namespace if it doesn't exist already.
      def add_child_node_and_reparent_attrs node
        add_child_node(node)
        node.attribute_nodes.find_all { |a| a.namespace }.each do |attr|
          attr.remove
          node[attr.name] = attr.value
        end
      end
    end
  end
end

doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'

puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)

This prints

#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">

Ideally, an API for manipulating attributes in a given namespace that doesn't do the parsing based on colon (e.g., uses xmlNewProp rather than xmlSetProp) would be fantastic.

What's the output from nokogiri -v?

# Nokogiri (1.8.4)
    ---
    warnings: []
    nokogiri: 1.8.4
    ruby:
      version: 2.4.4
      platform: x86_64-darwin17
      description: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-darwin17]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.8"
      libxslt_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.32"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      libxslt_patches: []
      compiled: 2.9.8
      loaded: 2.9.8

Can you provide a self-contained script that reproduces what you're seeing?

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions