Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfocr_save Pixmap_pdfocr_save Causes Hard Crash #1965

Closed
TheCapybaraClub opened this issue Oct 13, 2022 · 10 comments
Closed

pdfocr_save Pixmap_pdfocr_save Causes Hard Crash #1965

TheCapybaraClub opened this issue Oct 13, 2022 · 10 comments
Assignees
Labels
bug resolved fixed / implemented / answered

Comments

@TheCapybaraClub
Copy link

Describe the bug (mandatory)

Python crashes hard without any error or traceback output.

To Reproduce (mandatory)

import sys
import fitz

def get_tessocr(page, bbox):
    """Return OCR-ed span text using Tesseract.
    """
    mat = fitz.Matrix(5, 5)  # high resolution matrix

    # Step 1: Make a high-resolution image of the bbox.
    pix = page.get_pixmap(
        matrix=mat,
        clip=bbox,
    )
    ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
    ocrpage = ocrpdf[0]
    text = ocrpage.get_text()
    if text.endswith("\n"):
        text = text[:-1]
    return text


def gen_pdf(pdf_outpath='./output.pdf'):
    TextNorm_style = dict(fontname="hebo", fontsize=10)

    doc = fitz.open()
    page = doc.new_page(pno = -1,# insertion point: end of document
                    width = 595, # page dimension: A4 portrait
                    height = 842)
        
    bad = r'./temp/tmp8g7_xpcg\noequal\UTSW-100AT.msg'
    where_y = 100
    where_x = 75
    for fn, fp in enumerate(range(2000)):
        y = where_y+( (fn+1)*20 )
        where = fitz.Point(where_x, y)
        page.insert_text(where, f"- {fp} - {bad}", **TextNorm_style)
    
    doc.save(pdf_outpath)
    doc.close()
    print(f"Done. Saved: {pdf_outpath}")

output_path = './data/Interesting/gen_output.pdf'
gen_pdf(pdf_outpath=output_path)

pdfDoc = fitz.open(output_path)
print(f"Number of Pages: {len(pdfDoc)}")

for pn, page in enumerate(pdfDoc):
    print(f"page {pn}")
    textpage = page.get_textpage(flags=3)
    page_text_words = page.get_text("words", textpage=textpage)
    
    for wn, wb in enumerate(page_text_words):
        if chr(65533) in wb[4]:
            x0, y0, x1, y1, words, block_no, line_no, word_no = wb
            print(words)
            
            #this will cause a hard crash
            new_words = get_tessocr(page, [x0, y0, x1, y1])
            
pdfDoc.close()

Expected behavior (optional)

Expected to work, or at least provide an error output.

Your configuration (mandatory)

3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] 
 win32 
 
PyMuPDF 1.19.3: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-12-12 06:51:56.
Built for Python 3.6 on win32 (64-bit).

Request for support

It would be great if I could just ignore stuff that is going to cause this crash... miss extracting the text rather than crashing. Folllowing this, an error message and traceback rather than a hard crash would be great.

Additional context (optional)

Attempted to Apply to my use case: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract2.py

While I tried to provide code to create a PDF that (seems to) fail in the exact same way, I have other PDFs that I cannot share that encounter the same failure. In these other PDFs, I do not actually write text, I am simply trying to extract the text. Another difference is that these other PDFs don't seem to print any word_block words, and the kernel just crashes straight out.

The issue seems to have something to do with the unknown Unicode and maybe text not being in page bounds?

I tried to call page.clean_contents() before processing the page, but it didn't help.

I traced my actual use case and got this just before the hard crash. My use case module is called "PDFWorks.py"

PDFWorks.py(210):         ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
 --- modulename: fitz, funcname: pdfocr_tobytes
fitz.py(6886):         EnsureOwnership(self)
 --- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805):     if getattr(o, "this", None) and not o.this.own():
fitz.py(6887):         from io import BytesIO
 --- modulename: _bootstrap, funcname: _handle_fromlist
<frozen importlib._bootstrap>(1007): <frozen importlib._bootstrap>(1032): fitz.py(6889):         bio = BytesIO()
fitz.py(6890):         self.pdfocr_save(bio, compress=compress, language=language)
 --- modulename: fitz, funcname: pdfocr_save
fitz.py(6870):         EnsureOwnership(self)
 --- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805):     if getattr(o, "this", None) and not o.this.own():
fitz.py(6872):         return _fitz.Pixmap_pdfocr_save(self, filename, compress, language)

Maybe Related?
#1733
#1738

@julian-smith-artifex-com
Copy link
Collaborator

Thanks for the bug report and for providing the reproducer. I've reproduced the bug with latest PyMuPDF and MuPDF; it looks like a divide-by-zero inside MuPDF, caused by get_tessocr() calling pix.pdfocr_tobytes() with pix being a pixmap with zero height.

This will be fixed in the next release of PyMuPDF.

Separate from this, i notice that you are using PyMuPDF-1.19.3 and Python-3.6. Note that current and future PyMuPDF releases require at least Python-3.7 or later.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 14, 2022

Apart from that future fix, it can easily be prevented by checking meaningful rectangles before trying to make pixmaps.
I am not clear though how an extracted word can have x0 = x1 or y0 = y1. I remember an error of that sort in some earlier release, but have no version number for this at hand.

@TheCapybaraClub
Copy link
Author

Thank you Jorj! I love your work!

@JorjMcKie
Copy link
Collaborator

Thank you Jorj! I love your work!

Thank you, but remember: It's not mine (alone) any longer!
PyMuPDF's quality, soundness and thus its secured future is now in the powerful hands of Artifex' team of experts.

@TheCapybaraClub
Copy link
Author

checking meaningful rectangles before trying to make pixmaps didn't seem to prevent the crash. Or maybe this is not what you mean by checking for meaningful rectangles?

fRect = fitz.Rect([x0, y0, x1, y1])
if fRect.is_valid and not fRect.is_empty and not fRect.is_infinite:
    new_words = get_tessocr(page, [x0, y0, x1, y1])

@JorjMcKie
Copy link
Collaborator

checking meaningful rectangles before trying to make pixmaps didn't seem to prevent the crash. Or maybe this is not what you mean by checking for meaningful rectangles?

fRect = fitz.Rect([x0, y0, x1, y1])
if fRect.is_valid and not fRect.is_empty and not fRect.is_infinite:
    new_words = get_tessocr(page, [x0, y0, x1, y1])

Actually it is what I meant - although a check like y1-y0 > 1e-5 and x1-x0 > 1e-5 would have been sufficient. Did you upgrade to the most recent versions of Python and PyMuPDF?

@TheCapybaraClub
Copy link
Author

okay, updates made. Here is the environment:

! python -V
Python 3.10.8

!jupyter --version
Selected Jupyter core packages...
IPython : 8.5.0
ipykernel : 6.16.0
ipywidgets : 8.0.2
jupyter_client : 7.4.2
jupyter_core : 4.11.1
jupyter_server : 1.21.0
jupyterlab : not installed
nbclient : 0.7.0
nbconvert : 7.2.1
nbformat : 5.7.0
notebook : 6.5.1
qtconsole : 5.3.2
traitlets : 5.4.0

print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]
win32

PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.10 on win32 (64-bit).

Attempted for checking meaningful rectangles:

            if fRect.is_valid and not fRect.is_empty and not fRect.is_infinite and y1-y0 > 1e-5 and x1-x0 > 1e-5:
                new_words = get_tessocr(page, [x0, y0, x1, y1])

...skill crashed out

@TheCapybaraClub
Copy link
Author

Were you able to replicate the crash even after updates? Where you able to skip the processing of bad rectangle and avoid the crash?

@JorjMcKie
Copy link
Collaborator

Were you able to replicate the crash even after updates? Where you able to skip the processing of bad rectangle and avoid the crash?

Yes, the fix intercepts when an invalid pixmap would be built and reacts with an exception.

@julian-smith-artifex-com julian-smith-artifex-com added resolved fixed / implemented / answered and removed Fixed in next release labels Nov 8, 2022
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.21.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug resolved fixed / implemented / answered
Projects
None yet
Development

No branches or pull requests

3 participants