-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRuby Nokogiri raises StackOverflowError when parsing some pages #1501
Comments
This appears to be a bug in the upstream NekoHTML parser that Nokogiri uses. Would you be interested in reporting this upstream to that project? |
Hey @flavorjones! Sorry for the late reply. I'm happy to also report this upstream. Is https://sourceforge.net/p/nekohtml/bugs/ the correct place to report to? Any tips for getting a smaller test case that is less JRuby/Nokogiri specific, or do you think this stack as is will be useful to the upstream project? |
Oooh, I continue to be later than you with my replies. :-\ When I stop catching this exception, the error becomes more clear:
Let me spend a few minutes trying to build a repro in Java. |
Welp, when I use the nekohtml HTML sample program to parse this, there's no error, which tells me it has something to do with how we've configured the parser. Here's that code: package sample;
import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
public class TestHTMLDOM {
public static void main(String[] argv) throws Exception {
DOMParser parser = new DOMParser();
for (int i = 0; i < argv.length; i++) {
parser.parse(argv[i]);
print(parser.getDocument(), "");
}
}
public static void print(Node node, String indent) {
System.out.println(indent+node.getClass().getName());
Node child = node.getFirstChild();
while (child != null) {
print(child, indent+" ");
child = child.getNextSibling();
}
}
} |
OK, I tried configuring the Does anybody have time and bandwidth to look into trying to repro this in Java and filing a bug upstream? |
I spent a few hours trying to debug this. I think it is this block of code that is causing the stack overflow. It is replacing the elements used by the HTMLTagBalancer which causes it's internal stack of visited elements to be wrong, eventually leading to the infinite loop. I'm still trying to understand what's going on but thought i would share what I found so far in case someone else is looking into this. |
the patch accidentally removed the parents of the TR element. This caused any document fragment with a dangling (i.e. with no parent) TD or TR element to cause a stack overflow fixes #1501
pushed a fix in #1743. The parsed document is different on JRuby and MRI. Not sure if that's something we want to try to fix or just treat it as a Xerces/libxml expected difference. I would also like some ideas on how to test it. |
this is an ugly change whose only purpose is to mask the difference between libxml and nekohtml. we agreed to stop doing that a while ago and just accept that different libraries will behave different. furthermore, it caused a stack overflow while parding documents with a TD element that doesn't have any parents in #1501 fixes #1501
Thank you for working on this @jvshahid! |
Should be fixed in the next release. |
I have found at least one page that parses without error using Nokogiri 1.6.8 on Ruby 2.2.4 but raises a StackOverflowError on JRuby 1.7.25 and 9.1.2.0.
I expected Nokogiri to parse this file without raising any errors. Or at the very least raise errors consistently between the different Ruby versions.
Here is some Ruby code to reproduce the behavior:
Here's an example session running it on all three Ruby versions:
The text was updated successfully, but these errors were encountered: