-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Linearized PDFs #695
Comments
@skairunner I am aware PDFBox can not produce Linearized PDFs but (to my knowledge) it still can read/consume them. You sure your Linearized PDFs can not be delivered? Of course it can not stream/jump to a specific page which is a bottleneck. We have users with PDFs of 600+ Mbytes on S3 and Cantaloupe is able to generate Derivatives correctly but requires to have Source Cache around. Memory consumption is an issue but there are ways (pull might come soon from me) of reducing the memory consumption by enabled a PDFBox flag (subsampling) |
Yes, to my knowledge PDFBox doesn't have any problems with consuming linearized PDFs. In our tests, if Cantaloupe doesn't run out of memory it does deliver the tiles eventually. If the PDF file is not in the filesystem cache Cantaloupe has to download the entire large PDF file before generating tiles, which takes a while. The IIIF viewer we are using (Universal Viewer) requests several pages at once, which seems to make Cantaloupe request the same source multiple times and also open it multiple times in memory, which can kill it. The ideal outcome is having a processor in Cantaloupe that can take advantage of linearized PDFs and deliver tiles for random pages with low latency and memory usage. I definitely didn't intend this feature to be implemented in a speedy manner though. It seems like a large amount of work, after all, and for a fairly niche audience 😓 But having tracking issues is always a good thing and might help someone else down the road. |
hi @skairunner totally. IABookreader does the same as Universal Viewer which really does not help much. The |
Currently Cantaloupe uses PDFBox, which does not support linearized PDFs. So whenever Cantaloupe wants to render tiles for a PDF file it has to download the entire file before it can render each individual page. As seen here #198 and here #557 this is quite slow and impractical if PDFs get large. In our own testing, the cutoff point seems to be around 100MB, using the S3 backend.
It would be nice if Cantaloupe provided a new processor that supports linearized PDFs and, when provided with a random access-supported storage backend, can download only the page(s) it needs to render tiles. Unfortunately, PDFBox does not support this so an alternative PDF library will have to be used. From quick research, it seems there might not be many pure Java libraries that support this functionality. The C++ library qpdf supports linearized reading, though. This would complicate the build process and the processor might have to be distributed as an optional add-on.
The text was updated successfully, but these errors were encountered: