-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make selenium-generated PDF readable #321
Conversation
Try to find “%%EOF” in last 1Mb of file.
@akolpakov thank you! This fixed the problem I was having. |
Can this get merged? I'm seeing this right now in the most recent pip release |
Please merge this PR. We ran into the same issue. |
thank you @akolpakov ! |
Maybe this will help someone else -- but I ran into this issue when an xls file was included in my PDF list of files. After removing that xls file from the group to be merged, the issue no longer happens in my environment. Check all your |
This is a great change and should be merged in as-is. It attempts to address these issues, which I believe are duplicates of each other: #177 #442 #480 Is this change safe?I'm not very familiar with the PDF standard, but from what I've read this change is safe. I've looked through a PDF specification listed on Adobe's website (https://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf which is linked to from https://www.adobe.com/devnet/pdf/pdf_reference.html with the description "This document is an ISO approved copy of the ISO 32000-1 Standards document" and https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf) for references of %%EOF and it doesn't appear that there is anything meaningful after %%EOF. I've heard that some PDF generators might use this to store comments or other metadata. Personally I've seen a PDF that had a bunch of null bytes after %%EOF and I have no idea why. The second document that I mentioned above has this interesting note, "Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file." That document is dated 2006. I didn't attempt to ascertain whether that is currently the case. I looked at two other open source PDF libraries: Poppler previously looked for %%EOF in the last 1024 bytes ("we look in the last 1024 chars because Adobe does the same"), however that check was disabled in 2007 (https://gitlab.freedesktop.org/poppler/poppler/-/commit/be1b5a0196cdfc78f74e08a023b477cac16eb0f3) with the comment "Adobe does not seem to enforce %%EOF, so we do the same." PDFBox is considerably more complicated. The "1024" amount appears to be configurable. It also looks like there's a "lenient" mode where %%EOF is not required. Are there other changes that should be made?Yeah, probably. Why limit to 1MB? Why not look in the entire file? Why require it at all? If it's not there then look for startxref starting at the end of the file (I believe that's what Poppler does). Also it's definitely a good idea to add a unit test for this and it should be trivial. Also the test coverage from the existing tests (https://github.com/mstamy2/PyPDF2/blob/master/Tests/tests.py) is worryingly small. Other notesThe looping behavior was changed in PR #75. It's possible that change is slightly wrong and caused this problem... but I haven't tried to look at this code close enough to be convinced of that. This GitHub issue is related, and could possibly be resolved by refactoring this loop: #361 |
@markdoliner Thank you for the summary! |
Does anybody have a small PDF created by selenium that has this issue? I would like to add it to the test suite. |
Hi, @MartinThoma. Sorry I don't think I can share the PDF where I saw this problem since it contains private information :-( |
Are you maybe able to generate a minimal example with another website? If all selenium-generated PDFs are affected, this should be rather easy. You could https://pypi.org/project/PyPDF2/ |
I added a comment there with a link to a test PDF, and instruction to generate this kind of test PDF. |
Bug Fixes (BUG): - Use 1MB as offset for readNextEndLine (#321) - 'PdfFileWriter' object has no attribute 'stream' (#787) Robustness (ROB): - Invalid float object; use 0 as fallback (#782) Documentation (DOC): - Robustness (#785) Full Changelog: 1.27.7...1.27.8
Try to find “%%EOF” in last 1Mb of file. This fixes the issue with reading Selenium-generated PDF files. Closes py-pdf#177 Closes py-pdf#442 Closes py-pdf#480
Bug Fixes (BUG): - Use 1MB as offset for readNextEndLine (py-pdf#321) - 'PdfFileWriter' object has no attribute 'stream' (py-pdf#787) Robustness (ROB): - Invalid float object; use 0 as fallback (py-pdf#782) Documentation (DOC): - Robustness (py-pdf#785) Full Changelog: py-pdf/pypdf@1.27.7...1.27.8
Try to find “%%EOF” in last 1Mb of file.
In some PDF files, “%%EOF” sign can be far away from the end of document