Closed
Description
Describe the bug (mandatory)
get_text()
in versions >=1.22 produces � characters in some cases, usually related to LaTex. This was not an issue in v1.21.1
and other PDF libraries extract the text just fine (though pdfplumber
appears to miss a few characters)
Additionally, get_text(sort=True)
converts the � to \udc52
which creates other issues e.g. causes print()
to fail with error UnicodeEncodeError: 'utf-8' codec can't encode character '\udc52' in position 35: surrogates not allowed
To Reproduce (mandatory)
import fitz
import pdftotext
import pdfplumber
def print_comparison(fn, page):
#pymupdf
pymupdf_doc = fitz.open(fn)
#pdftotext
with open(fn, "rb") as f:
pdftotext_doc = pdftotext.PDF(f)
#pdfplumber
pdfplumber_doc = pdfplumber.open(fn)
print("PyMuPDF:\n")
print(repr(pymupdf_doc[page].get_text()))
print("\npdftotext:\n")
print(repr(pdftotext_doc[page]))
print("\npdfplumber:\n")
print(repr(pdfplumber_doc.pages[page].extract_text()))
print_comparison('1001.2481.pdf', 10)
PyMuPDF:
' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝ 𝑅𝑐��� − 𝑅𝑐���𝐶 1/2 which yields the following critical numbers: 𝑅𝑐���𝐶\n𝑝𝑖𝑝𝑐��� = 2550, \n 𝑅𝑐���𝐶\n𝑐ℎ𝑎𝑛𝑛𝑐���𝑙 = 1480 and 𝑅𝑐���𝐶\n𝑐���𝑡���𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases. \n \n \n \n'
pdftotext:
'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶𝑝𝑖𝑝𝑒 = 2550,\n𝑅𝑒𝐶𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝐶𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\nsaturation sets in much earlier than in the other two cases.\n\n\x0c'
pdfplumber:
'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒−𝑅𝑒 1/2 which yields the following critical numbers: 𝑅𝑒𝑝𝑖𝑝𝑒 = 2550,\n𝐶 𝐶\n𝑅𝑒𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\n𝐶 𝐶\nsaturation sets in much earlier than in the other two cases.'
PyMuPDF v1.21.1
' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝ 𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶\n𝑝𝑖𝑝𝑒 = 2550, \n 𝑅𝑒𝐶\n𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝐶\n𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases. \n \n \n \n'
Expected behavior (optional)
I expect the text to be extracted like it was in v1.21.1. If there are invalid characters, I'd also expect the sort to keep the characters the same.
Your configuration (mandatory)
- Ubuntu 18.04.6 LTS
- PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.10 on linux (64-bit).