Invalid characters in versions >= 1.22

## Describe the bug (mandatory)
`get_text()` in versions >=1.22 produces � characters in some cases, usually related to LaTex. This was not an issue in `v1.21.1` and other PDF libraries extract the text just fine (though `pdfplumber` appears to miss a few characters)

Additionally, `get_text(sort=True)` converts the � to `\udc52` which creates other issues e.g. causes `print()` to fail with error `UnicodeEncodeError: 'utf-8' codec can't encode character '\udc52' in position 35: surrogates not allowed`

[1001.2481.pdf](https://github.com/pymupdf/PyMuPDF/files/12113342/1001.2481.pdf)

## To Reproduce (mandatory)
```
import fitz
import pdftotext
import pdfplumber

def print_comparison(fn, page):
    #pymupdf
    pymupdf_doc = fitz.open(fn)

    #pdftotext
    with open(fn, "rb") as f:
        pdftotext_doc = pdftotext.PDF(f)

    #pdfplumber
    pdfplumber_doc = pdfplumber.open(fn)

    print("PyMuPDF:\n")
    print(repr(pymupdf_doc[page].get_text()))
    print("\npdftotext:\n")
    print(repr(pdftotext_doc[page]))
    print("\npdfplumber:\n")
    print(repr(pdfplumber_doc.pages[page].extract_text()))


print_comparison('1001.2481.pdf', 10)
```


```
PyMuPDF:

' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝  𝑅𝑐��� − 𝑅𝑐���𝐶 1/2 which yields the following critical numbers: 𝑅𝑐���𝐶\n𝑝𝑖𝑝𝑐��� = 2550, \n 𝑅𝑐���𝐶\n𝑐ℎ𝑎𝑛𝑛𝑐���𝑙 = 1480  and  𝑅𝑐���𝐶\n𝑐���𝑡���𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases.  \n \n \n \n'

pdftotext:

'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶𝑝𝑖𝑝𝑒 = 2550,\n𝑅𝑒𝐶𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝐶𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\nsaturation sets in much earlier than in the other two cases.\n\n\x0c'

pdfplumber:

'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒−𝑅𝑒 1/2 which yields the following critical numbers: 𝑅𝑒𝑝𝑖𝑝𝑒 = 2550,\n𝐶 𝐶\n𝑅𝑒𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\n𝐶 𝐶\nsaturation sets in much earlier than in the other two cases.'
```

PyMuPDF v1.21.1
```
' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝  𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶\n𝑝𝑖𝑝𝑒 = 2550, \n 𝑅𝑒𝐶\n𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480  and  𝑅𝑒𝐶\n𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases.  \n \n \n \n'
```


## Expected behavior (optional)
I expect the text to be extracted like it was in v1.21.1. If there are invalid characters, I'd also expect the sort to keep the characters the same. 



## Your configuration (mandatory)
 - Ubuntu 18.04.6 LTS
 - PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.10 on linux (64-bit).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Invalid characters in versions >= 1.22 #2553

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Invalid characters in versions >= 1.22 #2553

Description

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions