persistent get_text() formatting #2730

vorel99 · 2023-10-10T13:11:26Z

Is your feature request related to a problem? Please describe.
I have problem with loading the document. The problem is, that the page is loaded in wrong order

import fitz
doc = fitz.open('document.pdf')
doc.get_page_text(4)

so I added sort=True to resolve this

import fitz
doc = fitz.open('document.pdf')
doc.get_page_text(4, sort=True)

That resolved problem with sorting, but new problem has appeared. Some characters from text were replaced with <?>.
I found some info about this behavior in this page https://pymupdf.readthedocs.io/en/latest/recipes-common-issues-and-their-solutions.html#problem-unreadable-text
But I think, that this isn't desired behaviour.

Describe the solution you'd like
I don't know how is the package implemented, but would it be possible to use same text formatters from get_text("text") in get_text("blocks")?
It would resolve the inconsistent formatting when sort argument is changed.

Additional context
I'm sorry, I cannot send you the PDF file. It's internal file from company.

The text was updated successfully, but these errors were encountered:

vorel99 · 2023-10-11T10:43:11Z

Here is sample PDF. When .get_page_text(0, sort=False) is used, the Header and paragraph are in wrong order.
When .get_page_text(0, sort=True) is used, the header contains <?> symbol.
sample.pdf

JorjMcKie · 2023-10-11T12:00:46Z

Thanks for submitting this!
This is not an "enhancement" but a bug report 👍.
get_text(sort=True) internally uses get_text("blocks") so that the text can be sort by coordinates.
Except for plain (naive) text extraction, other variants use Python raw unicode encoding for. This can - as in your test case - lead to wrong results if accidentally "\u", "\U" character strings are encountered.

Text output uses raw backslash decoding. This will lead to wrong output for accidental character combinations "\u", "\U". This fix prevents this by outputting the backslash itself as backslash-encoded.

julian-smith-artifex-com · 2023-10-13T09:27:47Z

Fixed in 1.23.5.

vorel99 added the enhancement label Oct 10, 2023

vorel99 changed the title ~~persist text format~~ persistent get_text() formatting Oct 10, 2023

JorjMcKie added bug and removed enhancement labels Oct 11, 2023

JorjMcKie added a commit that referenced this issue Oct 11, 2023

Fix #2730

21f5675

Text output uses raw backslash decoding. This will lead to wrong output for accidental character combinations "\u", "\U". This fix prevents this by outputting the backslash itself as backslash-encoded.

JorjMcKie closed this as completed in 39b9932 Oct 11, 2023

JorjMcKie reopened this Oct 11, 2023

JorjMcKie added the Fixed in next release label Oct 11, 2023

julian-smith-artifex-com removed the Fixed in next release label Oct 13, 2023

julian-smith-artifex-com closed this as completed Oct 13, 2023

vorel99 mentioned this issue Oct 31, 2023

wrong encoding for "\Č" character when sort=True #2774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent get_text() formatting #2730

persistent get_text() formatting #2730

vorel99 commented Oct 10, 2023

vorel99 commented Oct 11, 2023

JorjMcKie commented Oct 11, 2023

julian-smith-artifex-com commented Oct 13, 2023

persistent get_text() formatting #2730

persistent get_text() formatting #2730

Comments

vorel99 commented Oct 10, 2023

vorel99 commented Oct 11, 2023

JorjMcKie commented Oct 11, 2023

julian-smith-artifex-com commented Oct 13, 2023