pdfocr_save Pixmap_pdfocr_save Causes Hard Crash

## Describe the bug (mandatory)
Python crashes hard without any error or traceback output.

## To Reproduce (mandatory)
```
import sys
import fitz

def get_tessocr(page, bbox):
    """Return OCR-ed span text using Tesseract.
    """
    mat = fitz.Matrix(5, 5)  # high resolution matrix

    # Step 1: Make a high-resolution image of the bbox.
    pix = page.get_pixmap(
        matrix=mat,
        clip=bbox,
    )
    ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
    ocrpage = ocrpdf[0]
    text = ocrpage.get_text()
    if text.endswith("\n"):
        text = text[:-1]
    return text


def gen_pdf(pdf_outpath='./output.pdf'):
    TextNorm_style = dict(fontname="hebo", fontsize=10)

    doc = fitz.open()
    page = doc.new_page(pno = -1,# insertion point: end of document
                    width = 595, # page dimension: A4 portrait
                    height = 842)
        
    bad = r'./temp/tmp8g7_xpcg\noequal\UTSW-100AT.msg'
    where_y = 100
    where_x = 75
    for fn, fp in enumerate(range(2000)):
        y = where_y+( (fn+1)*20 )
        where = fitz.Point(where_x, y)
        page.insert_text(where, f"- {fp} - {bad}", **TextNorm_style)
    
    doc.save(pdf_outpath)
    doc.close()
    print(f"Done. Saved: {pdf_outpath}")

output_path = './data/Interesting/gen_output.pdf'
gen_pdf(pdf_outpath=output_path)

pdfDoc = fitz.open(output_path)
print(f"Number of Pages: {len(pdfDoc)}")

for pn, page in enumerate(pdfDoc):
    print(f"page {pn}")
    textpage = page.get_textpage(flags=3)
    page_text_words = page.get_text("words", textpage=textpage)
    
    for wn, wb in enumerate(page_text_words):
        if chr(65533) in wb[4]:
            x0, y0, x1, y1, words, block_no, line_no, word_no = wb
            print(words)
            
            #this will cause a hard crash
            new_words = get_tessocr(page, [x0, y0, x1, y1])
            
pdfDoc.close()
```

## Expected behavior (optional)
Expected to work, or at least provide an error output.

## Your configuration (mandatory)
```
3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] 
 win32 
 
PyMuPDF 1.19.3: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-12-12 06:51:56.
Built for Python 3.6 on win32 (64-bit).
```

## Request for support
It would be great if I could just ignore stuff that is going to cause this crash... miss extracting the text rather than crashing. Folllowing this, an error message and traceback rather than a hard crash would be great.


## Additional context (optional)
Attempted to Apply to my use case: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract2.py

While I tried to provide code to create a PDF that (seems to) fail in the exact same way, I have other PDFs that I cannot share that encounter the same failure. In these other PDFs, I do not actually write text, I am simply trying to extract the text. Another difference is that these other PDFs don't seem to print any word_block words, and the kernel just crashes straight out.

The issue seems to have something to do with the unknown Unicode and maybe text not being in page bounds? 

I tried to call `page.clean_contents()` before processing the page, but it didn't help.

I traced my actual use case and got this just before the hard crash. My use case module is called "PDFWorks.py"
```
PDFWorks.py(210):         ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
 --- modulename: fitz, funcname: pdfocr_tobytes
fitz.py(6886):         EnsureOwnership(self)
 --- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805):     if getattr(o, "this", None) and not o.this.own():
fitz.py(6887):         from io import BytesIO
 --- modulename: _bootstrap, funcname: _handle_fromlist
<frozen importlib._bootstrap>(1007): <frozen importlib._bootstrap>(1032): fitz.py(6889):         bio = BytesIO()
fitz.py(6890):         self.pdfocr_save(bio, compress=compress, language=language)
 --- modulename: fitz, funcname: pdfocr_save
fitz.py(6870):         EnsureOwnership(self)
 --- modulename: fitz, funcname: EnsureOwnership
fitz.py(2805):     if getattr(o, "this", None) and not o.this.own():
fitz.py(6872):         return _fitz.Pixmap_pdfocr_save(self, filename, compress, language)
```

Maybe Related?
[#1733](https://github.com/pymupdf/PyMuPDF/issues/1733)
[#1738](https://github.com/pymupdf/PyMuPDF/issues/1738)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pdfocr_save Pixmap_pdfocr_save Causes Hard Crash #1965

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Request for support

Additional context (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pdfocr_save Pixmap_pdfocr_save Causes Hard Crash #1965

Description

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Request for support

Additional context (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions