Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid characters in versions >= 1.22 #2553

Closed
brandenkmurray opened this issue Jul 20, 2023 · 7 comments
Closed

Invalid characters in versions >= 1.22 #2553

brandenkmurray opened this issue Jul 20, 2023 · 7 comments
Labels
bug upstream bug bug outside this package

Comments

@brandenkmurray
Copy link

Describe the bug (mandatory)

get_text() in versions >=1.22 produces � characters in some cases, usually related to LaTex. This was not an issue in v1.21.1 and other PDF libraries extract the text just fine (though pdfplumber appears to miss a few characters)

Additionally, get_text(sort=True) converts the � to \udc52 which creates other issues e.g. causes print() to fail with error UnicodeEncodeError: 'utf-8' codec can't encode character '\udc52' in position 35: surrogates not allowed

1001.2481.pdf

To Reproduce (mandatory)

import fitz
import pdftotext
import pdfplumber

def print_comparison(fn, page):
    #pymupdf
    pymupdf_doc = fitz.open(fn)

    #pdftotext
    with open(fn, "rb") as f:
        pdftotext_doc = pdftotext.PDF(f)

    #pdfplumber
    pdfplumber_doc = pdfplumber.open(fn)

    print("PyMuPDF:\n")
    print(repr(pymupdf_doc[page].get_text()))
    print("\npdftotext:\n")
    print(repr(pdftotext_doc[page]))
    print("\npdfplumber:\n")
    print(repr(pdfplumber_doc.pages[page].extract_text()))


print_comparison('1001.2481.pdf', 10)
PyMuPDF:

' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝  𝑅𝑐��� − 𝑅𝑐���𝐶 1/2 which yields the following critical numbers: 𝑅𝑐���𝐶\n𝑝𝑖𝑝𝑐��� = 2550, \n 𝑅𝑐���𝐶\n𝑐ℎ𝑎𝑛𝑛𝑐���𝑙 = 1480  and  𝑅𝑐���𝐶\n𝑐���𝑡���𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases.  \n \n \n \n'

pdftotext:

'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶𝑝𝑖𝑝𝑒 = 2550,\n𝑅𝑒𝐶𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝐶𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\nsaturation sets in much earlier than in the other two cases.\n\n\x0c'

pdfplumber:

'Fig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the\ndimensionless growth rate of turbulent spots plotted as a function of Re close to the\nphase transition in pipe (A), channel (B) and square duct (C). The line follows\n𝐺 ∝ 𝑅𝑒−𝑅𝑒 1/2 which yields the following critical numbers: 𝑅𝑒𝑝𝑖𝑝𝑒 = 2550,\n𝐶 𝐶\n𝑅𝑒𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480 and 𝑅𝑒𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear\n𝐶 𝐶\nsaturation sets in much earlier than in the other two cases.'

PyMuPDF v1.21.1

' \n \nFig. 3. Growth rate of a turbulent spot close to the phase transition. Square of the \ndimensionless growth rate of turbulent spots plotted as a function of Re close to the \nphase transition in pipe (A), channel (B) and square duct (C). The line follows \n𝐺 ∝  𝑅𝑒 − 𝑅𝑒𝐶 1/2 which yields the following critical numbers: 𝑅𝑒𝐶\n𝑝𝑖𝑝𝑒 = 2550, \n 𝑅𝑒𝐶\n𝑐ℎ𝑎𝑛𝑛𝑒𝑙 = 1480  and  𝑅𝑒𝐶\n𝑑𝑢𝑐𝑡 = 2250 . Note that for the channel non-linear \nsaturation sets in much earlier than in the other two cases.  \n \n \n \n'

Expected behavior (optional)

I expect the text to be extracted like it was in v1.21.1. If there are invalid characters, I'd also expect the sort to keep the characters the same.

Your configuration (mandatory)

  • Ubuntu 18.04.6 LTS
  • PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
    Version date: 2023-06-21 00:00:01.
    Built for Python 3.10 on linux (64-bit).
@JorjMcKie
Copy link
Collaborator

Issue also confirmed for Windows. Will be fixed in coming version.

Background:
Plain text output currently does not use Python's escape string conversion - as opposed to other .get_text("variant") text extraction variants.
Using .get_text("text", sort=True) internally invokes .get_text("blocks", sort=True) and returns its respective text. This explains the different handling of certain unicodes.

This fix removes that difference, causing all output from variants "text", "words", "blocks", "dict"/"json" and "rawdict"/"rawjson" to make use of Python's PyUnicode_DecodeRawUnicodeEscape() C function.
Output variants "html", "xhtml" and "xml" continue to return unmodified MuPDF generated output and are therefore not concerned.

JorjMcKie added a commit that referenced this issue Sep 11, 2023
For text extraction `get_text("words")`, or extractWORDS, words are defined as strings not containing white space.
This change allows adding up to 64 characters to also function as delimiters.
This allows for instance to separate words from punctuations or to decompose an e-mail address into its components.

Other changes:

Fixing #2522: correcting the typo

Remove some unnecessary setting of flags when creating annotations.

Fixing #2553:
Adjust plain text extraction to use the same approach as other variants. This entails using Unicode escape strings on output instead of using the output of fz_chartorune.
Another consequence is that standard text output is directed to a fz_buffer instead to a fz_output.

Fixing #2556: Add checking the existence of path dictionaries at every possible place.
Includes an additional test function.

Add functions JM_ignore_rect / JM_ignore_irect which return a bool. The functions return True if the rectangle should be ignored.
This is the case for infinite and empty rectangles, but also for any rectangle that has a common edge with the infinite rectangle.

Support variable setting of character border widths for insert_text() / insert_textbox(). This is a factor to be multiplied
with the font size. Default is 0.05 (read: 5% of the fontsize). This value is relevant for text rendering modes 1 and 2 only.

Fixing #2637:
In Page.insert_textbox, when the last word of a line won't fit in the line buffer, we did not increase the line position. This is now handled correctly.
JorjMcKie added a commit that referenced this issue Oct 19, 2023
Previously, the output of plain text converted characters via fz_chartorune and "words", "blocks", "dict" and "rawdict" handled character conversions differently, using Python raw unicode decoding.
A yet somewhat different behavior was used in page.get_textbox() - which is plain text extraction from within a rectangle independent from using a clip.

This fix ensures that plain text extraction (including textbox) deliver the same output.
This is checked via comparing the set of characters produced in each of the cases.
@brandenkmurray
Copy link
Author

Thanks for getting a fix in. Any estimate on when you'll cut the next release?

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.6.

@vicent4no
Copy link

Hello everyone. It seems like this issue keeps happening specifically with this PDF

Universal_Registration_Document_2023_VUK.pdf

Using version 1.21.1 text gets extracted correctly using get_text('text') per each page. That is, there is no unknown characters represented by .

However, using version 1.23.6 produces them. Could you confirm this, please?

Sample code attached below.

import fitz

def extract_text_from_pdf(doc: fitz.Document):
    result = []
    
    for page in doc:
      result.append(page.get_text('text'))
    
    return result

doc: fitz.Document = fitz.open("path_to_pdf_here")

text_list: list[str] = extract_text_from_pdf(doc)
print(text_list)

@JorjMcKie
Copy link
Collaborator

@vicent4no there has been a bug report for the underlying library MuPDF for exactly this problem.
Should be fixed in next version.

Background:
That PDF uses a subset of the Calibri font, where the glyph used to invoke "space" for unknown reasons pointing to the ascii control character 0x09 (instead of pointing to space, 0x20) - for which there exists no glyph in the font. Hence the reaction 0xFFFD (invalid unicode).
The next version will include a fix that automatically replaces with space in these cases.

@vicent4no
Copy link

Thank you for your time and thorough explanation, I appreciate it.

JorjMcKie added a commit that referenced this issue Nov 23, 2023
JorjMcKie added a commit that referenced this issue Nov 23, 2023
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

4 participants