-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] v1.11 - HTML parsing ignores text encoding and attempt to force it #2173
Comments
@CodingAnarchy Thanks for opening this issue, and sorry that you're having a problem here. So, what's interesting is that I cannot reproduce this! Here's my setup:
I'm in the unfortunate position of not knowing much about how JRuby handles encoding, so I'm not even sure where to look to try to diagnose why this test fails on your system but passes on mine. |
I've also tried reproducing this on a Mac, and still don't see this issue. Here's that config:
|
👋 Bumping this, are you able to help me reproduce? I'll close in a few days if I don't hear back. |
@flavorjones I've dived into it a bit more and it turns out that we are ingesting an email that is Windows-1252 encoded, which the #! /usr/bin/env ruby
require 'nokogiri'
require 'mail'
require 'minitest/autorun'
class Test < MiniTest::Spec
describe "Node#css" do
it "should find a div using chained classes" do
message = Mail.read_from_string(File.read('test.eml'))
body = message.parts.last.decoded.to_s
doc = Nokogiri::HTML::Document.parse(body)
assert_equal 1, doc.css("td.foo").length
test_output = <<~HEREDOC
[
"1. Make sure the machine is completely updated and all your software has the latest patch.",
"2. Contact your incident response team. NOTE: If you don't have an incident response team, contact Microsoft Support for architectural remediation and forensic.",
"3. Install and run Microsoft's Malicious Software Removal Tool (see https://www.microsoft.com/en-us/download/malicious-software-removal-tool-details.aspx).",
"4. Run Microsoft's Autoruns utility and try to identify unknown applications that are configured to run at login (see https://technet.microsoft.com/en-us/sysinternals/bb963902.aspx).",
"5. Run Process Explorer and try to identify unknown running processes (see https://technet.microsoft.com/en-us/sysinternals/bb896653.aspx)."
]
HEREDOC
test_output = test_output.chomp # Need to remove extra new line at end of string from heredoc
assert_equal test_output, doc.at_css("td.foo").text # This fails because the apostrophe is parsed as `’`
end
end
end The We have had to workaround this by adding these lines after parsing and decoding the email, which seems to resolve Nokogiri's parsing of the email: # Remove meta tags related to encoding - these break Nokogiri v1.11 parsing
body.gsub!(/<meta .*>/, '') |
Hi @CodingAnarchy, I'm glad to hear you've narrowed down what's going on here. Unfortunately, the test case you've provided doesn't pass on CRuby because of carriage returns ( Can I ask you to spend a few minutes to simplify this test case by a) avoiding use of the |
@flavorjones Unfortunately, I've not been able to simplify this test and still have it fail in the case in question. I'm not sure what is different about the I also don't follow your comment about the new lines causing the test to fail. These strings should be identical to the heredoc, and I'm not able to replicate the issue you are referring to. |
Here's my config:
Here's the output from the script you provided run against the email you linked:
You can see that the quote character is not right, and there is a |
Closing, please let me know if you're able to reduce the test case to something simpler that will allow me to reproduce what you're seeing. |
A snippet of an HTML document with a meta-tag of
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
is being parsed as Windows-1252, despite being handled by Ruby as a UTF-8 encoded string (already decoded by Ruby), and ignoring any attempt to pass an explicit UTF-8 encoding.Expected behavior
Nokogiri parses this as a UTF-8 document when specified without mangling the text, as occurred in versions prior to 1.11.
Environment
Additional context
The text was updated successfully, but these errors were encountered: