Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] v1.11 - HTML parsing ignores text encoding and attempt to force it #2173

Closed
CodingAnarchy opened this issue Jan 15, 2021 · 8 comments
Closed

Comments

@CodingAnarchy
Copy link

A snippet of an HTML document with a meta-tag of <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> is being parsed as Windows-1252, despite being handled by Ruby as a UTF-8 encoded string (already decoded by Ruby), and ignoring any attempt to pass an explicit UTF-8 encoding.

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "Node#css" do
    it "should find a div using chained classes" do
      html = <<~HEREDOC
        <html>
          <head>
            <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
          </head>
          <body>
            <div class="foo">Here's some text.</div>
            </body>
       </html>
      HEREDOC
      
      doc = Nokogiri::HTML::Document.parse(html, nil, Encoding::UTF_8.to_s)
      
      assert_equal 1, doc.css("div.foo").length
      assert_equal "Here's some text.", doc.at_css("div.foo).text # This fails because the apostrophe is parsed as `’`
    end
  end
end

Expected behavior

Nokogiri parses this as a UTF-8 document when specified without mangling the text, as occurred in versions prior to 1.11.

Environment

# Nokogiri (1.11.1)
    ---
    warnings: []
    nokogiri:
      version: 1.11.1
    ruby:
      version: 2.5.7
      platform: java
      gem_platform: universal-java-11
      description: jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server
        VM 11.0.8+10 on 11.0.8+10 +jit [darwin-x86_64]
      engine: jruby
      jruby: 9.2.13.0
    other_libraries:
      xerces: Xerces-J 2.12.0
      nekohtml: NekoHTML 1.9.21

Additional context

@CodingAnarchy CodingAnarchy added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jan 15, 2021
@flavorjones
Copy link
Member

@CodingAnarchy Thanks for opening this issue, and sorry that you're having a problem here.

So, what's interesting is that I cannot reproduce this! Here's my setup:

~ $ nokogiri -v
# Nokogiri (1.11.1)
    ---
    warnings: []
    nokogiri:
      version: 1.11.1
    ruby:
      version: 2.5.7
      platform: java
      gem_platform: universal-java-11
      description: jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server
        VM 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 on 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 [linux-x86_64]
      engine: jruby
      jruby: 9.2.13.0
    other_libraries:
      xerces: Xerces-J 2.12.0
      nekohtml: NekoHTML 1.9.21

~ $ code/oss/nokogiri/foo.rb -v
Run options: -v --seed 14338

# Running:

Node#css#test_0001_should find a div using chained classes = 0.20 s = .

Finished in 0.206927s, 4.8326 runs/s, 9.6652 assertions/s.

1 runs, 2 assertions, 0 failures, 0 errors, 0 skips

I'm in the unfortunate position of not knowing much about how JRuby handles encoding, so I'm not even sure where to look to try to diagnose why this test fails on your system but passes on mine.

@flavorjones
Copy link
Member

I've also tried reproducing this on a Mac, and still don't see this issue. Here's that config:

# Nokogiri (1.11.1)
    ---
    warnings: []
    nokogiri:
      version: 1.11.1
    ruby:
      version: 2.5.7
      platform: java
      gem_platform: universal-java-15
      description: jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server
        VM 15.0.1+9 on 15.0.1+9 [darwin-x86_64]
      engine: jruby
      jruby: 9.2.13.0
    other_libraries:
      xerces: Xerces-J 2.12.0
      nekohtml: NekoHTML 1.9.21

@flavorjones flavorjones added needs/more-info platform/jruby and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 15, 2021
@flavorjones
Copy link
Member

👋 Bumping this, are you able to help me reproduce? I'll close in a few days if I don't hear back.

@CodingAnarchy
Copy link
Author

@flavorjones I've dived into it a bit more and it turns out that we are ingesting an email that is Windows-1252 encoded, which the mail gem is decoding to UTF-8, and then we are feeding an HTML part in to nokogiri for parsing. This worked with v1.10.10, but it fails now on v1.11.1.

#! /usr/bin/env ruby

require 'nokogiri'
require 'mail'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "Node#css" do
    it "should find a div using chained classes" do
      message = Mail.read_from_string(File.read('test.eml'))
      body = message.parts.last.decoded.to_s

      doc = Nokogiri::HTML::Document.parse(body)

      assert_equal 1, doc.css("td.foo").length

      test_output = <<~HEREDOC
      [
      "1. Make sure the machine is completely updated and all your software has the latest patch.",
      "2. Contact your incident response team. NOTE: If you don't have an incident response team, contact Microsoft Support for architectural remediation and forensic.",
      "3. Install and run Microsoft's Malicious Software Removal Tool (see https://www.microsoft.com/en-us/download/malicious-software-removal-tool-details.aspx).",
      "4. Run Microsoft's Autoruns utility and try to identify unknown applications that are configured to run at login (see https://technet.microsoft.com/en-us/sysinternals/bb963902.aspx).",
      "5. Run Process Explorer and try to identify unknown running processes (see https://technet.microsoft.com/en-us/sysinternals/bb896653.aspx)."
      ]
      HEREDOC
      test_output = test_output.chomp # Need to remove extra new line at end of string from heredoc

      assert_equal test_output, doc.at_css("td.foo").text # This fails because the apostrophe is parsed as `’`
    end
  end
end

The .eml file I used for testing can be found in this public gist.

We have had to workaround this by adding these lines after parsing and decoding the email, which seems to resolve Nokogiri's parsing of the email:

        # Remove meta tags related to encoding - these break Nokogiri v1.11 parsing
        body.gsub!(/<meta .*>/, '')

@flavorjones
Copy link
Member

flavorjones commented Jan 30, 2021

Hi @CodingAnarchy, I'm glad to hear you've narrowed down what's going on here. Unfortunately, the test case you've provided doesn't pass on CRuby because of carriage returns (\r) and the use of ' instead of in the expected value, and doesn't pass on JRuby for the reason you're demonstrating.

Can I ask you to spend a few minutes to simplify this test case by a) avoiding use of the mail gem, b) avoiding loading an external file, and c) ensure that it passes on CRuby?

@CodingAnarchy
Copy link
Author

@flavorjones Unfortunately, I've not been able to simplify this test and still have it fail in the case in question. I'm not sure what is different about the mail gem being in the loop than when I attempt to just do the encoding differences myself. (It should be as simple as a force_encoding("Windows-1252").encoding("UTF-8"), but it appears that I am missing something about this process.

I also don't follow your comment about the new lines causing the test to fail. These strings should be identical to the heredoc, and I'm not able to replicate the issue you are referring to.

@flavorjones
Copy link
Member

I also don't follow your comment about the new lines causing the test to fail.

Here's my config:

# Nokogiri (1.11.1)
    ---
    warnings: []
    nokogiri:
      version: 1.11.1
      cppflags:
      - "-I/home/flavorjones/.rvm/gems/ruby-2.7.2/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri"
      - "-I/home/flavorjones/.rvm/gems/ruby-2.7.2/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri/include"
      - "-I/home/flavorjones/.rvm/gems/ruby-2.7.2/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri/include/libxml2"
    ruby:
      version: 2.7.2
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      - 0006-htmlParseComment-treat-as-if-it-closed-the-comment.patch
      - 0007-use-new-htmlParseLookupCommentEnd-to-find-comment-en.patch
      - '0008-use-glibc-strlen.patch'
      - '0009-avoid-isnan-isinf.patch'
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.7.2/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri"
      iconv_enabled: true
      compiled: 2.9.10
      loaded: 2.9.10
    libxslt:
      source: packaged
      precompiled: true
      patches: []
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11

Here's the output from the script you provided run against the email you linked:

Run options: --seed 16388

# Running:

F

Finished in 0.227378s, 4.3980 runs/s, 8.7959 assertions/s.

  1) Failure:
Node#css#test_0001_should find a div using chained classes [./foo.rb:29]:
--- expected
+++ actual
@@ -1,7 +1,7 @@
-"[
-\"1. Make sure the machine is completely updated and all your software has the latest patch.\",
-\"2. Contact your incident response team. NOTE: If you don't have an incident response team, contact Microsoft Support for architectural remediation and forensic.\",
-\"3. Install and run Microsoft's Malicious Software Removal Tool (see https://www.microsoft.com/en-us/download/malicious-software-removal-tool-details.aspx).\",
-\"4. Run Microsoft's Autoruns utility and try to identify unknown applications that are configured to run at login (see https://technet.microsoft.com/en-us/sysinternals/bb963902.aspx).\",
-\"5. Run Process Explorer and try to identify unknown running processes (see https://technet.microsoft.com/en-us/sysinternals/bb896653.aspx).\"
+"[\r
+\"1. Make sure the machine is completely updated and all your software has the latest patch.\",\r
+\"2. Contact your incident response team. NOTE: If you don’t have an incident response team, contact Microsoft Support for architectural remediation and forensic.\",\r
+\"3. Install and run Microsoft’s Malicious Software Removal Tool (see https://www.microsoft.com/en-us/download/malicious-software-removal-tool-details.aspx).\",\r
+\"4. Run Microsoft’s Autoruns utility and try to identify unknown applications that are configured to run at login (see https://technet.microsoft.com/en-us/sysinternals/bb963902.aspx).\",\r
+\"5. Run Process Explorer and try to identify unknown running processes (see https://technet.microsoft.com/en-us/sysinternals/bb896653.aspx).\"\r
 ]"


1 runs, 2 assertions, 1 failures, 0 errors, 0 skips

You can see that the quote character is not right, and there is a \r at the end of each line.

@flavorjones
Copy link
Member

Closing, please let me know if you're able to reduce the test case to something simpler that will allow me to reproduce what you're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants