-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfocr_save Pixmap_pdfocr_save Causes Hard Crash #1965
Comments
Thanks for the bug report and for providing the reproducer. I've reproduced the bug with latest PyMuPDF and MuPDF; it looks like a divide-by-zero inside MuPDF, caused by This will be fixed in the next release of PyMuPDF. Separate from this, i notice that you are using PyMuPDF-1.19.3 and Python-3.6. Note that current and future PyMuPDF releases require at least Python-3.7 or later. |
Apart from that future fix, it can easily be prevented by checking meaningful rectangles before trying to make pixmaps. |
Thank you Jorj! I love your work! |
Thank you, but remember: It's not mine (alone) any longer! |
checking meaningful rectangles before trying to make pixmaps didn't seem to prevent the crash. Or maybe this is not what you mean by checking for meaningful rectangles?
|
Actually it is what I meant - although a check like |
okay, updates made. Here is the environment:
PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library. Attempted for checking meaningful rectangles:
...skill crashed out |
Were you able to replicate the crash even after updates? Where you able to skip the processing of bad rectangle and avoid the crash? |
Yes, the fix intercepts when an invalid pixmap would be built and reacts with an exception. |
Fixed in 1.21.0 |
Describe the bug (mandatory)
Python crashes hard without any error or traceback output.
To Reproduce (mandatory)
Expected behavior (optional)
Expected to work, or at least provide an error output.
Your configuration (mandatory)
Request for support
It would be great if I could just ignore stuff that is going to cause this crash... miss extracting the text rather than crashing. Folllowing this, an error message and traceback rather than a hard crash would be great.
Additional context (optional)
Attempted to Apply to my use case: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract2.py
While I tried to provide code to create a PDF that (seems to) fail in the exact same way, I have other PDFs that I cannot share that encounter the same failure. In these other PDFs, I do not actually write text, I am simply trying to extract the text. Another difference is that these other PDFs don't seem to print any word_block words, and the kernel just crashes straight out.
The issue seems to have something to do with the unknown Unicode and maybe text not being in page bounds?
I tried to call
page.clean_contents()
before processing the page, but it didn't help.I traced my actual use case and got this just before the hard crash. My use case module is called "PDFWorks.py"
Maybe Related?
#1733
#1738
The text was updated successfully, but these errors were encountered: