Skip to content

Invalid characters in versions >= 1.22 #2553

Closed
@brandenkmurray

Description

@brandenkmurray

Describe the bug (mandatory)

get_text() in versions >=1.22 produces � characters in some cases, usually related to LaTex. This was not an issue in v1.21.1 and other PDF libraries extract the text just fine (though pdfplumber appears to miss a few characters)

Additionally, get_text(sort=True) converts the � to \udc52 which creates other issues e.g. causes print() to fail with error UnicodeEncodeError: 'utf-8' codec can't encode character '\udc52' in position 35: surrogates not allowed

1001.2481.pdf

To Reproduce (mandatory)

import fitz
import pdftotext
import pdfplumber

def print_comparison(fn, page):
    #pymupdf
    pymupdf_doc = fitz.open(fn)

    #pdftotext
    with open(fn, "rb") as f:
        pdftotext_doc = pdftotext.PDF(f)

    #pdfplumber
    pdfplumber_doc = pdfplumber.open(fn)

    print("PyMuPDF:\n")
    print(repr(pymupdf_doc[page].get_text()))
    print("\npdftotext:\n")
    print(repr(pdftotext_doc[page]))
    print("\npdfplumber:\n")
    print(repr(pdfplumber_doc.pages[page].extract_text()))


print_comparison('1001.2481.pdf', 10)
PyMuPDF:

' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝  𝑅𝑐��� − 𝑅𝑐���𝐶 1/2 which yields the following critical numbers: 𝑅𝑐���𝐶\n𝑝𝑖𝑝𝑐��� = 2550, \n 𝑅𝑐���𝐶\n𝑐ℎ𝑎𝑛𝑛𝑐���𝑙 = 1480  and  𝑅𝑐���𝐶\n𝑐���𝑡���𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases.  \n \n \n \n'

pdftotext:

'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶𝑝𝑖𝑝𝑒 = 2550,\n𝑅𝑒𝐶𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝐶𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\nsaturation sets in much earlier than in the other two cases.\n\n\x0c'

pdfplumber:

'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒−𝑅𝑒 1/2 which yields the following critical numbers: 𝑅𝑒𝑝𝑖𝑝𝑒 = 2550,\n𝐶 𝐶\n𝑅𝑒𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\n𝐶 𝐶\nsaturation sets in much earlier than in the other two cases.'

PyMuPDF v1.21.1

' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝  𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶\n𝑝𝑖𝑝𝑒 = 2550, \n 𝑅𝑒𝐶\n𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480  and  𝑅𝑒𝐶\n𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases.  \n \n \n \n'

Expected behavior (optional)

I expect the text to be extracted like it was in v1.21.1. If there are invalid characters, I'd also expect the sort to keep the characters the same.

Your configuration (mandatory)

  • Ubuntu 18.04.6 LTS
  • PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
    Version date: 2023-06-21 00:00:01.
    Built for Python 3.10 on linux (64-bit).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions