-
Notifications
You must be signed in to change notification settings - Fork 114
Creating a DocumentFragment results in a slow memory leak #20
Comments
@rubys any chance to get this fixed? I'm still experiencing this and since I'm having a lot of background jobs that process html this issue eats up all available memory. The only way to fix it is to restart sidekiq. |
I don't know how to proceed with this. Let me describe what is happening, in parse (which people should be able to follow, even if they don't know C): nokogumbo/ext/nokogumboc/nokogumbo.c Line 186 in d56f954
Possibilities include: the gumbo parser is what is leaking (i.e., the call to |
One additional comment: there is only one call to |
I spent a few hours looking into this and I'm afraid I don't know where the memory leak is. Before starting my investigation, I could think of a handful of possible causes.
First, I am able to reproduce the slow memory leak using @rgrove's code (thanks for that!). It starts off taking about 55 MB of RAM on my machine and slowly increases. By contrast, Nokogiri's Second, I used Ruby's Third, Gumbo's Fourth, it's unlikely that Nokogumbo itself is leaking (apart from misusing the libxml2 API). As @rubys points out, there's a single call to So where does this leave us? It's possible there's a misuse of the libxml2 API. I don't know what the object ownership model of libxml2 is. I assume that the document owns all of its nodes but the individual function documentation isn't great. Nokogumbo is pretty simple and its use of the libxml2 API is also pretty simple: It creates a document, it creates an "internal subset" which I think is just a DTD, and it creates some nodes, and properties. The functions it uses are the following.
I checked that The results of Oh. Having done all of this and written it all up, I've found where the leaks occur. Using
I have no idea why Well now I have even less idea what's going on. I ran |
Oh. If Nokogumbo was compiled without libxml2 headers, then static VALUE xmlNewDoc(char* version) {
VALUE doc = rb_funcall(Document, new, 0);
rb_funcall(rb_funcall(doc, internal_subset, 0), remove_, 0);
return doc;
} I bet that's part of what is causing the leak! I have no idea why it would have been built that way. I don't know why it's using more memory now that I'm not building it that way. I'll have to investigate later. I don't have any more time to spend right now. |
At least one leak is a bug in Nokogiri. I'm not sure if there's a workaround or not. And I appear to have found a second leak, but I haven't tracked down its source and I really do need to stop working on this now. |
If you think there's a memory leak in Nokogiri, kindly report it with as much information as you have and we'll look into it. |
@flavorjones I did here. |
Ah, I see you've reported it at sparklemotion/nokogiri#1784, will take a look as soon as I can (on vacation at the moment) |
I'm pretty sure that I've worked out where the bug in Nokogiri is. I left a detailed comment here. That said, it should only be an issue if you're compiling Nokogumbo without the libxml2 headers. The 2.0.0 alpha version that's currently on Rubygems tries pretty hard to not do that unless you explicitly request it. I'll try to verify that this is the case soon, but if anyone affected by this issue would care to try that gem out and give feedback, that'd be fantastic. |
I believe @flavorjones fixed this several years ago. I'm going to close the issue. Feel free to re-open if this hasn't fixed it and I'll investigate further. Thanks again for the helpful test case! |
Creating a DocumentFragment from a Nokogumbo document appears to result in a slow memory leak. The larger the HTML input, the larger the leak is. Here's a simple repro case (I tested against Nokogumbo 1.3.0):
After 100,000 iterations, this process consumes 87MB on my system.
The leak isn't specific to the
Nokogiri::HTML5.fragment
method. Creating a fragment manually also leaks:Vanilla Nokogiri doesn't leak:
I tried poking around a bit to see if I could spot an obvious cause, but C isn't my forte.
The text was updated successfully, but these errors were encountered: