Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby Nokogiri raises ConcurrentModificationException for some HTML #1520

Closed
tehsven opened this issue Jul 29, 2016 · 2 comments
Closed

JRuby Nokogiri raises ConcurrentModificationException for some HTML #1520

tehsven opened this issue Jul 29, 2016 · 2 comments

Comments

@tehsven
Copy link

tehsven commented Jul 29, 2016

The LostText class from cyberneko throws a ConcurrentModificationException when parsing some HTML. See below output for environment information.

Removing any part of the HTML in the sample test case succeeds at parsing, including removing the x in <span> and the comment.

I expect this HTML to not raise a ConcurrentModificationException.

Sample test case:

require 'nokogiri'

puts RUBY_DESCRIPTION
puts "Nokigiri version: #{Gem.loaded_specs['nokogiri'].version}"

html = <<-EOF
  <tr></tr>
  <span>x</span><!-- -->
EOF

begin
  result = Nokogiri::HTML(html)
  puts "SUCCESS"
rescue Exception => e
  puts "ERROR: #{e.class}: #{e.message}\n#{e.backtrace.map { |x| "\t#{x}" }.join("\n")}"
  exit(1)
end

Output:

$ ruby nokogiri_concurrent_mod.rb
Ignoring jruby-launcher-1.1.1-java because its extensions are not built.  Try: gem pristine jruby-launcher --version 1.1.1
jruby 1.7.23 (1.9.3p551) 2015-11-24 f496dd5 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 +jit [darwin-x86_64]
Nokigiri version: 1.6.8
ERROR: Java::JavaUtil::ConcurrentModificationException:
    java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
    java.util.ArrayList$Itr.next(ArrayList.java:851)
    org.cyberneko.html.LostText.refeed(LostText.java:69)
    org.cyberneko.html.HTMLTagBalancer.consumeEarlyTextIfNeeded(HTMLTagBalancer.java:545)
    org.cyberneko.html.HTMLTagBalancer.comment(HTMLTagBalancer.java:534)
    org.cyberneko.html.HTMLScanner$ContentScanner.scanComment(HTMLScanner.java:2482)
    org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2074)
    org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
    org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    nokogiri.internals.NokogiriDomParser.parse(NokogiriDomParser.java:94)
    nokogiri.internals.XmlDomParserContext.do_parse(XmlDomParserContext.java:248)
    nokogiri.internals.XmlDomParserContext.parse(XmlDomParserContext.java:234)
    nokogiri.HtmlDocument.do_parse(HtmlDocument.java:119)
    nokogiri.HtmlDocument.read_memory(HtmlDocument.java:187)
    nokogiri.HtmlDocument$INVOKER$s$0$0$read_memory.call(HtmlDocument$INVOKER$s$0$0$read_memory.gen)
    org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:296)
    org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:72)
    org.jruby.ast.FCallManyArgsNode.interpret(FCallManyArgsNode.java:60)
    org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
    org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
    org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
    org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:112)
    org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:169)
    org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:286)
    org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:81)
    org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:85)
    org.jruby.ast.CallManyArgsBlockPassNode.interpret(CallManyArgsBlockPassNode.java:57)
    org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
    org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
    org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:182)
    org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:203)
    org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)
    org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)
    nokogiri_concurrent_mod.chained_0_rescue_1$RUBY$SYNTHETIC__file__(nokogiri_concurrent_mod.rb:12)
    nokogiri_concurrent_mod.__file__(nokogiri_concurrent_mod.rb:11)
    nokogiri_concurrent_mod.load(nokogiri_concurrent_mod.rb)
    org.jruby.Ruby.runScript(Ruby.java:857)
    org.jruby.Ruby.runScript(Ruby.java:850)
    org.jruby.Ruby.runNormally(Ruby.java:729)
    org.jruby.Ruby.runFromMain(Ruby.java:578)
    org.jruby.Main.doRunFromMain(Main.java:393)
    org.jruby.Main.internalRun(Main.java:288)
    org.jruby.Main.run(Main.java:217)
    org.jruby.Main.main(Main.java:197)
@tehsven tehsven changed the title JRuby Nokogiri raises ConcurrentModificationException for valid HTML JRuby Nokogiri raises ConcurrentModificationException for some HTML Jul 30, 2016
@flavorjones
Copy link
Member

That's crazy. Almost definitely an upstream bug in NekoHTML. I've never reported anything upstream to that project; would you be willing to reach out to them?

@flavorjones
Copy link
Member

This was fixed in v1.8.3, I suspect some of the great work @kares did in that release addressed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants