[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

seanstory · 2023-02-21T21:51:15Z

Please describe the bug

I'm attempting to parse html content from a site I do not control. Specifically, https://2e.aonprd.com.
I'm taking the raw HTMl and attempting to extract text content, excluding common header and footer text, to use for a search usecase. My plan was to do something like:

html_content = get_html_content(url)
parsed_data = Nokogiri::HTML.parse(html_content)
text_i_care_about = parsed_data.at_css('[id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"]').text # forgive the long selector

However, I've noticed that some pages are getting "" content.

When I go to browser dev tools and access text by the selector with $$("#ctl00_RadDrawer1_Content_MainContent_DetailedOutput").map(e=>e.textContent), I get the text I'm expecting, so it's not an issue with having gotten the selector wrong.

When I step through with irb, I can see that an extra </span> is being inserted right before the text content, so that the result of the Nokogiri::HTML.parse is closing my identfied element early. I'll attach the raw HTML, for an example page of https://2e.aonprd.com/(X(1)S(jjv5qg45qaziuq55lopb3o45))/Classes.aspx?ID=1

html.zip

Help us reproduce what you're seeing

#! /usr/bin/env ruby

require 'nokogiri'

content = File.read('raw.html')
parsed_data = Nokogiri::HTML.parse(content)
body_content = parsed_data.at_css('[id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"]')
puts body_content.text # the empty string
puts parsed_data # scroll up, and see that right after `ctl00_RadDrawer1_Content_MainContent_DetailedOutput` there's loads of text, but a new, erroneous </span> has been added right before it

Expected behavior

Nokogiri shouldn't add extra closing tags

Environment
OSX 13.2.1
Platform: arm64-darwin
reproduced in:

JRuby 9.3.3.0, Nokogiri 1.13.10
MRI Ruby 2.6.9, Nokogiri 1.13.4
MRI Ruby 2.7.7, Nokogiri 1.14.2

# Nokogiri (1.14.2)
    ---
    warnings: []
    nokogiri:
      version: 1.14.2
      cppflags:
      - "-I/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri"
      - "-I/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri/include"
      - "-I/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.7.7
      platform: arm64-darwin22
      gem_platform: arm64-darwin-22
      description: ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [arm64-darwin22]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.10.3
      loaded: 2.10.3
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.37
      loaded: 1.1.37
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri

The text was updated successfully, but these errors were encountered:

flavorjones · 2023-02-22T16:18:47Z

@seanstory Sorry you're having a problem. I'll try to explain what's going on here. In summary, the HTML you're parsing is not well-formed, and so parsers will try to "fix it up".

Notably, HTML4 does not have a specification for how "fixing up" should be done, and so parsers may all do different things. But HTML5 does have a "fix up" spec, so if you want to match modern browser behavior you should use Nokogiri::HTML5 and not Nokogiri::HTML

Here's the start of the markup from raw.html that you're trying to operate on:

<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"><h1 class="title"><a href ="PFS.aspx"><span style="float:left;"><img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png"></a></span>Alchemist</h1>...

Let me format that better so you can see the structure more clearly:

  <html>
    <body>
      <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
        <h1 class="title">
          <a href="PFS.aspx">
            <span style="float:left;">
              <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
            </a>
          </span>
          Alchemist
        </h1>
    </body>
  </html>

You should be able to see pretty clearly that the opening and closing tags are mismatched. When the parser sees the closing </a> tag, it will auto-close any other tags that were enclosed in that a element, which includes the span. Later, when it sees the closing </span> tag it auto-closes the h1. Finally,when it sees the </h1> tag it can't find a matching opening tag and drops it.

Click here to see some working code to demonstrate what's happening.

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

html = <<~HTML
  <html>
    <body>
      <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
        <h1 class="title">
          <a href="PFS.aspx">
            <span style="float:left;">
              <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
            </a>
          </span>
          Alchemist
        </h1>
    </body>
  </html>
HTML

doc = Nokogiri::HTML4::Document.parse(html)

doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html>\n" +
#    "  <body>\n" +
#    "    <span id=\"ctl00_RadDrawer1_Content_MainContent_DetailedOutput\">\n" +
#    "      <h1 class=\"title\">\n" +
#    "        <a href=\"PFS.aspx\">\n" +
#    "          <span style=\"float:left;\">\n" +
#    "            <img alt=\"PFS Standard\" title=\"PFS Standard\" style=\"height:25px; padding:2px 10px 0px 2px\" src=\"ImagesIconsPFS_Standard.png\">\n" +
#    "          </span></a>\n" +
#    "        </h1></span>\n" +
#    "        Alchemist\n" +
#    "      \n" +
#    "  </body>\n" +
#    "</html>\n"

doc.errors
# => [#<Nokogiri::XML::SyntaxError: 11:12: ERROR: Unexpected end tag : h1>]

So the final, corrected markup will look like:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
      <h1 class="title">
        <a href="PFS.aspx">
          <span style="float:left;">
            <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="ImagesIconsPFS_Standard.png">
          </span></a>
        </h1></span>
        Alchemist
      
  </body>
</html>

But note that libgumbo (Nokogiri::HTML5 on CRuby) corrects this differently! And possibly the same way your browser fixes it up.

Click here to see more code demonstrating the HTML5 behavior.

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

html = <<~HTML
  <html>
    <body>
      <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
        <h1 class="title">
          <a href="PFS.aspx">
            <span style="float:left;">
              <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
            </a>
          </span>
          Alchemist
        </h1>
    </body>
  </html>
HTML

doc = Nokogiri::HTML5::Document.parse(html, max_errors: 10)

doc.to_html
# => "<html><head></head><body>\n" +
#    "    <span id=\"ctl00_RadDrawer1_Content_MainContent_DetailedOutput\">\n" +
#    "      <h1 class=\"title\">\n" +
#    "        <a href=\"PFS.aspx\">\n" +
#    "          <span style=\"float:left;\">\n" +
#    "            <img alt=\"PFS Standard\" title=\"PFS Standard\" style=\"height:25px; padding:2px 10px 0px 2px\" src=\"ImagesIconsPFS_Standard.png\">\n" +
#    "          </span></a>\n" +
#    "        \n" +
#    "        Alchemist\n" +
#    "      </h1>\n" +
#    "  \n" +
#    "\n" +
#    "</span></body></html>"

doc.errors
# => [#<Nokogiri::XML::SyntaxError:"1:1: ERROR: Expected a doctype token\n<html>\n^">,
#     #<Nokogiri::XML::SyntaxError:"8:11: ERROR: That tag isn't allowed here  Currently open tags: html, body, span, h1, a, span.\n          </a>\n          ^">,
#     #<Nokogiri::XML::SyntaxError:"9:9: ERROR: That tag isn't allowed here  Currently open tags: html, body, span, h1.\n        </span>\n        ^">,
#     #<Nokogiri::XML::SyntaxError:"12:3: ERROR: That tag isn't allowed here  Currently open tags: html, body, span.\n  </body>\n  ^">]

And the parsed HTML5 DOM looks like:

<html><head></head><body>
    <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
      <h1 class="title">
        <a href="PFS.aspx">
          <span style="float:left;">
            <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="ImagesIconsPFS_Standard.png">
          </span></a>
        
        Alchemist
      </h1>
  

</span></body></html>

I hope all this makes sense! What questions do you have for me?

seanstory · 2023-02-22T20:35:34Z

@flavorjones thanks for responding so fast! This explanation makes sense, thank you so much for the help. I'm bummed that this solution isn't available for JRuby, but I see there's an open issue for that, so maybe one day. 🤞

seanstory added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Feb 21, 2023

flavorjones closed this as completed Feb 22, 2023

flavorjones added meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

seanstory commented Feb 21, 2023

flavorjones commented Feb 22, 2023 •

edited

Loading

seanstory commented Feb 22, 2023

[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

Comments

seanstory commented Feb 21, 2023

flavorjones commented Feb 22, 2023 • edited Loading

seanstory commented Feb 22, 2023

flavorjones commented Feb 22, 2023 •

edited

Loading