Description
Describe the bug (mandatory)
Python crashes hard without any error or traceback output.
To Reproduce (mandatory)
import sys
import fitz
def get_tessocr(page, bbox):
"""Return OCR-ed span text using Tesseract.
"""
mat = fitz.Matrix(5, 5) # high resolution matrix
# Step 1: Make a high-resolution image of the bbox.
pix = page.get_pixmap(
matrix=mat,
clip=bbox,
)
ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
ocrpage = ocrpdf[0]
text = ocrpage.get_text()
if text.endswith("\n"):
text = text[:-1]
return text
def gen_pdf(pdf_outpath='./output.pdf'):
TextNorm_style = dict(fontname="hebo", fontsize=10)
doc = fitz.open()
page = doc.new_page(pno = -1,# insertion point: end of document
width = 595, # page dimension: A4 portrait
height = 842)
bad = r'./temp/tmp8g7_xpcg\noequal\UTSW-100AT.msg'
where_y = 100
where_x = 75
for fn, fp in enumerate(range(2000)):
y = where_y+( (fn+1)*20 )
where = fitz.Point(where_x, y)
page.insert_text(where, f"- {fp} - {bad}", **TextNorm_style)
doc.save(pdf_outpath)
doc.close()
print(f"Done. Saved: {pdf_outpath}")
output_path = './data/Interesting/gen_output.pdf'
gen_pdf(pdf_outpath=output_path)
pdfDoc = fitz.open(output_path)
print(f"Number of Pages: {len(pdfDoc)}")
for pn, page in enumerate(pdfDoc):
print(f"page {pn}")
textpage = page.get_textpage(flags=3)
page_text_words = page.get_text("words", textpage=textpage)
for wn, wb in enumerate(page_text_words):
if chr(65533) in wb[4]:
x0, y0, x1, y1, words, block_no, line_no, word_no = wb
print(words)
#this will cause a hard crash
new_words = get_tessocr(page, [x0, y0, x1, y1])
pdfDoc.close()
Expected behavior (optional)
Expected to work, or at least provide an error output.
Your configuration (mandatory)
3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)]
win32
PyMuPDF 1.19.3: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-12-12 06:51:56.
Built for Python 3.6 on win32 (64-bit).
Request for support
It would be great if I could just ignore stuff that is going to cause this crash... miss extracting the text rather than crashing. Folllowing this, an error message and traceback rather than a hard crash would be great.
Additional context (optional)
Attempted to Apply to my use case: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract2.py
While I tried to provide code to create a PDF that (seems to) fail in the exact same way, I have other PDFs that I cannot share that encounter the same failure. In these other PDFs, I do not actually write text, I am simply trying to extract the text. Another difference is that these other PDFs don't seem to print any word_block words, and the kernel just crashes straight out.
The issue seems to have something to do with the unknown Unicode and maybe text not being in page bounds?
I tried to call page.clean_contents()
before processing the page, but it didn't help.
I traced my actual use case and got this just before the hard crash. My use case module is called "PDFWorks.py"
PDFWorks.py(210): ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
--- modulename: fitz, funcname: pdfocr_tobytes
fitz.py(6886): EnsureOwnership(self)
--- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805): if getattr(o, "this", None) and not o.this.own():
fitz.py(6887): from io import BytesIO
--- modulename: _bootstrap, funcname: _handle_fromlist
<frozen importlib._bootstrap>(1007): <frozen importlib._bootstrap>(1032): fitz.py(6889): bio = BytesIO()
fitz.py(6890): self.pdfocr_save(bio, compress=compress, language=language)
--- modulename: fitz, funcname: pdfocr_save
fitz.py(6870): EnsureOwnership(self)
--- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805): if getattr(o, "this", None) and not o.this.own():
fitz.py(6872): return _fitz.Pixmap_pdfocr_save(self, filename, compress, language)