Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] jruby nokogiri + bson + boilerpipe garbles html #2550

Closed
seanstory opened this issue May 13, 2022 · 5 comments
Closed

[bug] jruby nokogiri + bson + boilerpipe garbles html #2550

seanstory opened this issue May 13, 2022 · 5 comments

Comments

@seanstory
Copy link

seanstory commented May 13, 2022

Please describe the bug

We seem to have hit an extreme edge case. Consider this script:

require 'nokogiri'
html = <<-HTML
<body>
  <div>
    <div>
      <ul class='outer'>
        <li class='outer1'>
          <div>
            <ul class='inner1'><li class='inner11'></li></ul>
          </div>
        </li>
        <li class='outer2'>
          <div>
            <ul class='inner2'><li class='inner22'></li></ul>
          </div>
        </li>
      </ul>
      <div>
        Some text 
      </div>
    </div>
    Other text
  </div>
</body>
HTML

doc = Nokogiri::HTML(html)
puts doc.to_xhtml

The output we expect (whitespace edited for clarity):

<html>
  <head></head>
  <body>
    <div>
      <div>
        <ul class="outer">
          <li class="outer1">
            <div>
              <ul class="inner1"><li class="inner11"></li></ul>
            </div>
          </li>
          <li class="outer2">
            <div>
              <ul class="inner2"><li class="inner22"></li></ul>
            </div>
          </li>
        </ul>
        <div>
          Some text 
        </div>
      </div>
      Other text
    </div>
  </body>
</html>

when using MRI Ruby, this works as expected. When using JRuby, this works as expected. But if you make two small changes:

  • add a require 'bson' to the top of the script
  • add a Jars.lock file including de.l3s.boilerpipe:boilerpipe:1.1.0:compile:

the output becomes malformed (whitespace edited for clarity):

<html>
  <head>
</head>
  <body>
    <div>
      <div>
        <ul class="outer">
          <li class="outer1">
            <div>
              <ul class="inner1"></ul>
            </div>
          </li>
          <li class="inner11"></li>
        </ul>
      </div>
      <li class="outer2">
        <div>
          <ul class="inner2"></ul>
        </div>
      </li>
      <li class="inner22"></li>
   </div>
   <div>
     Some text 
   </div>
   Other text
  </body>
</html>
  • outer now only wraps outer1, but not outer2
  • inner11 and innter22 have escaped their wrapping divs
  • Some text and Other text are now outside of the top-level div

Help us reproduce what you're seeing

Since this is such an odd edge case of the environment, a simple script will not reproduce. However, I've put together a minimal repository that illustrates the issue: https://github.com/seanstory/crawler-html-parse-bug

Expected behavior

I'd expect that adding require 'bson' somewhere in my code shouldn't change the behavior of Nokogiri. I'd also expect that the random presence of the boilerpipe-1.1.0.jar in the load path wouldn't change the behavior of Nokogiri.

However if this is expected behavior, I'd love to know of any workaround we can use to prevent this odd behavior from surfacing, as we can't easily remove either of these triggering conditions from our codebase.

Environment

$ nokogiri -v
# Nokogiri (1.13.6)
    ---
    warnings: []
    nokogiri:
      version: 1.13.6
    ruby:
      version: 2.6.8
      platform: java
      gem_platform: universal-java-11
      description: jruby 9.3.3.0 (2.6.8) 2022-01-19 b26de1f5c5 OpenJDK 64-Bit Server VM
        11.0.14+9 on 11.0.14+9 +jit [darwin-x86_64]
      engine: jruby
      jruby: 9.3.3.0
    other_libraries:
      xerces: Xerces-J 2.12.2
      nekohtml: NekoHTML 1.9.22.noko2

Additional context

I'm not sure if it's expected behavior, but it did surprise us to find that the require 'bson' caused BSON::Object to become an ancestor of Nokogiri::HTML4::Document:

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri::HTML4::Document.ancestors
=> [Nokogiri::HTML4::Document, Nokogiri::XML::Document, Nokogiri::XML::Node, Enumerable, Nokogiri::ClassResolver, Nokogiri::XML::Searchable, Nokogiri::XML::PP::Node, Object, Kernel, BasicObject]
irb(main):003:0> require 'bson'
=> true
irb(main):004:0> Nokogiri::HTML4::Document.ancestors
=> [Nokogiri::HTML4::Document, Nokogiri::XML::Document, Nokogiri::XML::Node, Enumerable, Nokogiri::ClassResolver, Nokogiri::XML::Searchable, Nokogiri::XML::PP::Node, Object, BSON::Object, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]
@seanstory seanstory added the state/needs-triage Inbox for non-installation-related bug reports or help requests label May 13, 2022
@flavorjones
Copy link
Member

Hi @seanstory, thanks for opening this issue. I agree things like this shouldn't happen.

I'm curious why you're filing a bug with Nokogiri and not with the BSON gem? Since arguably Nokogiri's only fault here is ... not defending itself properly? Honestly I don't know, but it probably makes more sense to start with BSON since they're the ones injecting behavior into Object:

$ gem install bson
Fetching bson-4.15.0.gem
Building native extensions. This could take a while...
Successfully installed bson-4.15.0
1 gem installed

$ irb
>> require 'bson'
=> true
>> Object.ancestors
=> 
[Object,                                                                                 
 BSON::Object,                                                                           
 JSON::Ext::Generator::GeneratorMethods::Object,                                         
 PP::ObjectMixin,                                                                        
 Kernel,                                                                                 
 BasicObject]                                                                            
>> 

I'll be honest and say I'm not sure I'm equipped to try to debug this. I don't know what BSON is, and I don't know what a Jars.lock file is. If I have time I'll dig into it.

You may want to test if this is still happening against Nokogiri main which has overhauled how we load the java libraries in JRuby. That approach, using jar-dependencies, will be in Nokogiri 1.14 when it drops (soonish).

@flavorjones flavorjones added platform/jruby needs/research help wanted and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels May 14, 2022
@flavorjones
Copy link
Member

It looks like boilerpipe provides its own version of nekohtml which likely conflicts with Nokogiri's version of nekohtml:

https://github.com/kohlschutter/boilerpipe

I'm very curious if this can be reproduced with Nokogiri main because we're moved away from nekohtml to htmlunit-neko which is (after many years) only a cousin of nekohtml.

@flavorjones
Copy link
Member

When I run your crawler-html-parse-bug against Nokogiri main (v1.14.0.dev) the error does not reproduce.

If you reverse the order of require to load nokogiri before bson, this also does not reproduce, which may be a workaround for you until v1.14.0 is shipped.

What else can I help with?

@seanstory
Copy link
Author

@flavorjones amazing! Thank you so much for investigating so quickly. The workaround of switching the imports is fantastic, and we'll make a note to upgrade to 1.14.0 when it releases, as a longer-term solution.

I'm curious why you're filing a bug with Nokogiri and not with the BSON gem? Since arguably Nokogiri's only fault here is ... not defending itself properly?

Yeah, that's fair. I debated where to file, and landed on here, because you'd have the better appreciation for expected vs actual behavior. But you really came through for me, and I super appreciate it! 🎉

@flavorjones
Copy link
Member

you really came through for me, and I super appreciate it!

🥳 Happy to help! I'm going to close this and tag it with v1.14.0. Still hoping to get that release out in the next week!

@flavorjones flavorjones added this to the v1.14.0 milestone May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants