-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PdfBox Processor memory issue #198
Comments
Hi @FrancoisPaquette, When I try that PDF using PdfBoxProcessor in 3.4.1, all appears normal (although this is in standalone mode with Java 9): http://localhost:8182/iiif/2/github-198%2F97521.pdf/full/full/0/default.jpg
|
There is indeed a difference using the standalone mode, it uses less memory and it looks like fewer processes are involved also. If I request the images through the web browser it works fine. What I forgot to tell is that I’m getting the memory issue when I send the requests with a small program made to pre-generate thumbnails. Is there maybe something to consider when sending request to Cantaloupe? Also, out of curiosity, I tried using ImageMagick and it’s not better. For some reason, it looks like the connexion closes when I start getting the response. Thanks |
I wonder if the pre-generation program is failing to close its connections? Especially if other processors are having the same issue, and the same requests work in a web browser. I'm only guessing, but if connections don't get closed, it's possible that stuff doesn't get garbage-collected... |
I tested the pre-generation using Jmeter with the same PDF files and I'm seeing the same issue although it takes longer before it happens. I’m not seeing any connections left open in Jmeter. About the web browser, I wasn't seeing clearly last Friday ; the issue is there also. Sorry about that. Thanks |
I created a few hundred PDFs, about 1.5MB each, and wrote a simple script to request them all in sequence: #!/bin/bash
for i in {1..500}
do
curl http://localhost:8182/iiif/2/page-$i.pdf/full/full/0/default.jpg > /dev/null
done Here's what 10 minutes of jconsole monitoring looks like for me with Cantaloupe 3.4.1, PdfBoxProcessor, JDK 9.0.1, standalone mode, Looks pretty normal to me. I also tried with copies of the PDF you shared with similar results. One thing that could cause different results is if there were a large number parallel requests... but, to fill up a 30GB heap would require a huge number of parallel requests, and/or some huge images. |
Hmmm. I'm not running the script from the server itself but from my computer. I'll give it a try from the server and see what happens, I'll get back to you. Thanks a lot |
Not much difference running it from the server as I was expecting. I was able to crash it in about 45 minutes. If you look at the column 'Err', you'll see the number of errors growing. You can see how much memory was used. That test was done using the version 3.4 in a Tomcat with JDK 8.0.91 We are getting those PDF from many different government agencies so we have not control over them and how they are created. There is probably something weird in some of them causing the issue. If you feel like it, I could you send you a bunch of them? If not, that's totally fine :) Thanks |
Yes, send me some PDFs! Also I'm interested in the script you're using, if it's different enough from my little bash script... |
Just to let know that I'll bet out of the office next week, I'll get back to in 2 weeks. |
Hi, In the zip file you will have 100 PDF, I could send you more if needed. But I think it should be enough if you do not turn on the cache and use them in a loop until to see the issue. I needed about 800 the last time. https://www.dropbox.com/s/f9ehokdakqt842o/PDF.rar?dl=0 Also, you will see the file ‘thumbnail.rb, which is the script that I have made to send the request to the server to generate the thumbnails (it was done using the version 2.4.2 of Ruby). Feel free to modify or improve as you wish but basically you will only have to change the name of my server by yours. I do not know if this script could be useful to other people but I would gladly share it if it is. If you think it could be, you could place it on your Github. It’s up to you. I also have a Jmeter test file that I created to test rapidly your hypothesis about the connection that I could send you. You will also see the file ‘ids.txt’ which contains the IDs of all the PDF. The script will read those IDs and use them to send the request to the server. That’s about it, let me know if you have any question. Thanks! |
Sorry for the delay -- I think there are a couple of things going on here:
Can I share one of these non-loading PDFs with the PDFBox developers? |
No worries, thanks for looking into it. Sure, you can share a PDF with the PDFBOx developers. Let me know the Jira number, I could maybe help out. Regarding the read timeout, I was in fact using it to keep the connection open. Keeping it open for 300 seconds was enough to be able to process most of the PDFs. I didn't look at the value in the library but I think it's around 60 seconds or so. I think I just remove a 0, thus the 30 you are seeing, because I didn't want to keep it open longer anymore, it was to toubleshoot. I may be wrong but I did a few tests with ImageMagick and I believe the problem was the same. For some reason the processor is not working anymore on my test server, I'll looking into it (I had a bit of problem making it work) and test it again. |
I did a few tests to see if your last commit has any positive effect and I'm not seeing any difference, just to let you know. ImageMagick is definitely working a lot better as it indeed stop loading if the connexion closes as you were saying. Some of the PDF are detected as PNG by IM, it probably doesn't help PDFBox. |
Sorry, this fell off my radar. Here is my support request: https://lists.apache.org/thread.html/1f05908dba47cbb2b60c45ec3e433855e073394cabdd85b9c4b6bcac@%3Cusers.pdfbox.apache.org%3E |
No worries. Let me know if you need more PDFs. |
Not sure if it's related but we are seeing issues with PDFs as well. Not nearly at this level but we do see blocking threads so I dug a little deeper. It looks as though PDFBox is calling a synchronized method that is blocking our processing. We have one PDF page that we're trying to render that takes 6 seconds. For 4 of those seconds it's blocked. Here is the thread dump of the block:
The This is with this set in PDFBoxProcessor
I found a StackOverflow article that addresses a very similar issue: I'm still looking but I can see in our thread dump for a single PDF image PDFBox ends up calling that synchronized method 7 different times, several of which become blocked. |
Hi,
I’m having issues with some of our PDF using PdfBox. For reasons that I’m trying to figure out, they are taking quite a lot of memory, some directly from the server itself. This has for effect that after multiple requests to Cantaloupe, I’m getting a ‘out of memory’. Either the OOM killer is triggered by the server or Cantaloupe just crash. The only way to release the memory to the server is to restart the application server.
Here’s an example of a PDF very problematic.
https://www.dropbox.com/s/er4oblu57bigzhk/97521.pdf?dl=0
Looking closer what is going on memory wise, I’m seeing those PDF filled up the young generation. So I added more memory to the server, the JVM can use up to 30 G, which help but didn’t solve the problem. I then started tuning the garbage collector (the G1 seems to be the better choice for the situation), which also helped but the problem is still there.
I was wondering if you have seen a similar problem before?
Thanks
The text was updated successfully, but these errors were encountered: