Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Linearized PDFs #695

Open
skairunner opened this issue Aug 22, 2024 · 3 comments
Open

Feature request: Linearized PDFs #695

skairunner opened this issue Aug 22, 2024 · 3 comments

Comments

@skairunner
Copy link

Currently Cantaloupe uses PDFBox, which does not support linearized PDFs. So whenever Cantaloupe wants to render tiles for a PDF file it has to download the entire file before it can render each individual page. As seen here #198 and here #557 this is quite slow and impractical if PDFs get large. In our own testing, the cutoff point seems to be around 100MB, using the S3 backend.

It would be nice if Cantaloupe provided a new processor that supports linearized PDFs and, when provided with a random access-supported storage backend, can download only the page(s) it needs to render tiles. Unfortunately, PDFBox does not support this so an alternative PDF library will have to be used. From quick research, it seems there might not be many pure Java libraries that support this functionality. The C++ library qpdf supports linearized reading, though. This would complicate the build process and the processor might have to be distributed as an optional add-on.

@DiegoPino
Copy link
Contributor

@skairunner I am aware PDFBox can not produce Linearized PDFs but (to my knowledge) it still can read/consume them. You sure your Linearized PDFs can not be delivered? Of course it can not stream/jump to a specific page which is a bottleneck. We have users with PDFs of 600+ Mbytes on S3 and Cantaloupe is able to generate Derivatives correctly but requires to have Source Cache around. Memory consumption is an issue but there are ways (pull might come soon from me) of reducing the memory consumption by enabled a PDFBox flag (subsampling) PDFRenderer.setSubsamplingAllowed(true) which might help. qpdf might be a solution but might require building a complete new processor, but most importantly discuss with the development team what the approach is for new processors architecture, since there has been a trend to move out of External Binaries (e.g imagemagic processor was removed) for handling derivatives.

@skairunner
Copy link
Author

Yes, to my knowledge PDFBox doesn't have any problems with consuming linearized PDFs. In our tests, if Cantaloupe doesn't run out of memory it does deliver the tiles eventually. If the PDF file is not in the filesystem cache Cantaloupe has to download the entire large PDF file before generating tiles, which takes a while. The IIIF viewer we are using (Universal Viewer) requests several pages at once, which seems to make Cantaloupe request the same source multiple times and also open it multiple times in memory, which can kill it.

The ideal outcome is having a processor in Cantaloupe that can take advantage of linearized PDFs and deliver tiles for random pages with low latency and memory usage.

I definitely didn't intend this feature to be implemented in a speedy manner though. It seems like a large amount of work, after all, and for a fairly niche audience 😓 But having tracking issues is always a good thing and might help someone else down the road.

@DiegoPino
Copy link
Contributor

hi @skairunner totally. IABookreader does the same as Universal Viewer which really does not help much. The download the large file is an issue for other processors too that can not stream e.g OpenJPEG one for JP2000 but in general an issue with Remote storage (goes the same for a 1Gbyte+ non pyramidal TIFF). Will share the use case/need in our next week's call. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants