Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persistent get_text() formatting #2730

Closed
vorel99 opened this issue Oct 10, 2023 · 3 comments
Closed

persistent get_text() formatting #2730

vorel99 opened this issue Oct 10, 2023 · 3 comments
Labels

Comments

@vorel99
Copy link

vorel99 commented Oct 10, 2023

Is your feature request related to a problem? Please describe.
I have problem with loading the document. The problem is, that the page is loaded in wrong order

import fitz
doc = fitz.open('document.pdf')
doc.get_page_text(4)

so I added sort=True to resolve this

import fitz
doc = fitz.open('document.pdf')
doc.get_page_text(4, sort=True)

That resolved problem with sorting, but new problem has appeared. Some characters from text were replaced with <?>.
I found some info about this behavior in this page https://pymupdf.readthedocs.io/en/latest/recipes-common-issues-and-their-solutions.html#problem-unreadable-text
But I think, that this isn't desired behaviour.

Describe the solution you'd like
I don't know how is the package implemented, but would it be possible to use same text formatters from get_text("text") in get_text("blocks")?
It would resolve the inconsistent formatting when sort argument is changed.

Additional context
I'm sorry, I cannot send you the PDF file. It's internal file from company.

@vorel99 vorel99 changed the title persist text format persistent get_text() formatting Oct 10, 2023
@vorel99
Copy link
Author

vorel99 commented Oct 11, 2023

Here is sample PDF. When .get_page_text(0, sort=False) is used, the Header and paragraph are in wrong order.
When .get_page_text(0, sort=True) is used, the header contains <?> symbol.
sample.pdf

@JorjMcKie
Copy link
Collaborator

Thanks for submitting this!
This is not an "enhancement" but a bug report 👍.
get_text(sort=True) internally uses get_text("blocks") so that the text can be sort by coordinates.
Except for plain (naive) text extraction, other variants use Python raw unicode encoding for. This can - as in your test case - lead to wrong results if accidentally "\u", "\U" character strings are encountered.

@JorjMcKie JorjMcKie added bug and removed enhancement labels Oct 11, 2023
JorjMcKie added a commit that referenced this issue Oct 11, 2023
Text output uses raw backslash decoding. This will lead to wrong output for accidental character combinations "\u", "\U".
This fix prevents this by outputting the backslash itself as backslash-encoded.
@JorjMcKie JorjMcKie reopened this Oct 11, 2023
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants