Skip to content

pdfocr_save Pixmap_pdfocr_save Causes Hard Crash #1965

Closed
@TheCapybaraClub

Description

@TheCapybaraClub

Describe the bug (mandatory)

Python crashes hard without any error or traceback output.

To Reproduce (mandatory)

import sys
import fitz

def get_tessocr(page, bbox):
    """Return OCR-ed span text using Tesseract.
    """
    mat = fitz.Matrix(5, 5)  # high resolution matrix

    # Step 1: Make a high-resolution image of the bbox.
    pix = page.get_pixmap(
        matrix=mat,
        clip=bbox,
    )
    ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
    ocrpage = ocrpdf[0]
    text = ocrpage.get_text()
    if text.endswith("\n"):
        text = text[:-1]
    return text


def gen_pdf(pdf_outpath='./output.pdf'):
    TextNorm_style = dict(fontname="hebo", fontsize=10)

    doc = fitz.open()
    page = doc.new_page(pno = -1,# insertion point: end of document
                    width = 595, # page dimension: A4 portrait
                    height = 842)
        
    bad = r'./temp/tmp8g7_xpcg\noequal\UTSW-100AT.msg'
    where_y = 100
    where_x = 75
    for fn, fp in enumerate(range(2000)):
        y = where_y+( (fn+1)*20 )
        where = fitz.Point(where_x, y)
        page.insert_text(where, f"- {fp} - {bad}", **TextNorm_style)
    
    doc.save(pdf_outpath)
    doc.close()
    print(f"Done. Saved: {pdf_outpath}")

output_path = './data/Interesting/gen_output.pdf'
gen_pdf(pdf_outpath=output_path)

pdfDoc = fitz.open(output_path)
print(f"Number of Pages: {len(pdfDoc)}")

for pn, page in enumerate(pdfDoc):
    print(f"page {pn}")
    textpage = page.get_textpage(flags=3)
    page_text_words = page.get_text("words", textpage=textpage)
    
    for wn, wb in enumerate(page_text_words):
        if chr(65533) in wb[4]:
            x0, y0, x1, y1, words, block_no, line_no, word_no = wb
            print(words)
            
            #this will cause a hard crash
            new_words = get_tessocr(page, [x0, y0, x1, y1])
            
pdfDoc.close()

Expected behavior (optional)

Expected to work, or at least provide an error output.

Your configuration (mandatory)

3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] 
 win32 
 
PyMuPDF 1.19.3: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-12-12 06:51:56.
Built for Python 3.6 on win32 (64-bit).

Request for support

It would be great if I could just ignore stuff that is going to cause this crash... miss extracting the text rather than crashing. Folllowing this, an error message and traceback rather than a hard crash would be great.

Additional context (optional)

Attempted to Apply to my use case: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract2.py

While I tried to provide code to create a PDF that (seems to) fail in the exact same way, I have other PDFs that I cannot share that encounter the same failure. In these other PDFs, I do not actually write text, I am simply trying to extract the text. Another difference is that these other PDFs don't seem to print any word_block words, and the kernel just crashes straight out.

The issue seems to have something to do with the unknown Unicode and maybe text not being in page bounds?

I tried to call page.clean_contents() before processing the page, but it didn't help.

I traced my actual use case and got this just before the hard crash. My use case module is called "PDFWorks.py"

PDFWorks.py(210):         ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
 --- modulename: fitz, funcname: pdfocr_tobytes
fitz.py(6886):         EnsureOwnership(self)
 --- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805):     if getattr(o, "this", None) and not o.this.own():
fitz.py(6887):         from io import BytesIO
 --- modulename: _bootstrap, funcname: _handle_fromlist
<frozen importlib._bootstrap>(1007): <frozen importlib._bootstrap>(1032): fitz.py(6889):         bio = BytesIO()
fitz.py(6890):         self.pdfocr_save(bio, compress=compress, language=language)
 --- modulename: fitz, funcname: pdfocr_save
fitz.py(6870):         EnsureOwnership(self)
 --- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805):     if getattr(o, "this", None) and not o.this.own():
fitz.py(6872):         return _fitz.Pixmap_pdfocr_save(self, filename, compress, language)

Maybe Related?
#1733
#1738

Metadata

Metadata

Assignees

Labels

bugresolvedfixed / implemented / answered

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions