-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] jruby nokogiri + bson + boilerpipe garbles html #2550
Comments
Hi @seanstory, thanks for opening this issue. I agree things like this shouldn't happen. I'm curious why you're filing a bug with Nokogiri and not with the BSON gem? Since arguably Nokogiri's only fault here is ... not defending itself properly? Honestly I don't know, but it probably makes more sense to start with BSON since they're the ones injecting behavior into
I'll be honest and say I'm not sure I'm equipped to try to debug this. I don't know what BSON is, and I don't know what a You may want to test if this is still happening against Nokogiri |
It looks like boilerpipe provides its own version of nekohtml which likely conflicts with Nokogiri's version of nekohtml: https://github.com/kohlschutter/boilerpipe I'm very curious if this can be reproduced with Nokogiri |
When I run your If you reverse the order of What else can I help with? |
@flavorjones amazing! Thank you so much for investigating so quickly. The workaround of switching the imports is fantastic, and we'll make a note to upgrade to 1.14.0 when it releases, as a longer-term solution.
Yeah, that's fair. I debated where to file, and landed on here, because you'd have the better appreciation for expected vs actual behavior. But you really came through for me, and I super appreciate it! 🎉 |
🥳 Happy to help! I'm going to close this and tag it with v1.14.0. Still hoping to get that release out in the next week! |
Please describe the bug
We seem to have hit an extreme edge case. Consider this script:
The output we expect (whitespace edited for clarity):
when using MRI Ruby, this works as expected. When using JRuby, this works as expected. But if you make two small changes:
require 'bson'
to the top of the scriptJars.lock
file includingde.l3s.boilerpipe:boilerpipe:1.1.0:compile:
the output becomes malformed (whitespace edited for clarity):
outer
now only wrapsouter1
, but notouter2
inner11
andinnter22
have escaped their wrapping divsSome text
andOther text
are now outside of the top-level divHelp us reproduce what you're seeing
Since this is such an odd edge case of the environment, a simple script will not reproduce. However, I've put together a minimal repository that illustrates the issue: https://github.com/seanstory/crawler-html-parse-bug
Expected behavior
I'd expect that adding
require 'bson'
somewhere in my code shouldn't change the behavior of Nokogiri. I'd also expect that the random presence of theboilerpipe-1.1.0.jar
in the load path wouldn't change the behavior of Nokogiri.However if this is expected behavior, I'd love to know of any workaround we can use to prevent this odd behavior from surfacing, as we can't easily remove either of these triggering conditions from our codebase.
Environment
Additional context
I'm not sure if it's expected behavior, but it did surprise us to find that the
require 'bson'
causedBSON::Object
to become an ancestor ofNokogiri::HTML4::Document
:The text was updated successfully, but these errors were encountered: