Page.get_text("dict") what‘s the logic to separate different BLOCKS #4326

anakaft · 2025-02-25T08:56:57Z

anakaft
Feb 25, 2025

Hi， I’m a new programmer ,while extracting text from pdf ,I used page.get_text("dict") and draw rect to show which block it's from,
how those blocks are separated? I tried to check the source code but couldnot find it
I want to know the logic, like they are separated by a certain y0 coordinate
for example ,why the subtask XXX is regarded a same blcok with paragraph(1),
thanks

JorjMcKie · 2025-02-25T09:32:51Z

JorjMcKie
Feb 25, 2025
Maintainer

The hierarchy block -> line -> span is (almost) completely created inside the base library, MuPDF.
The decision which content becomes part of a block is subject to a number of conditions like inter-line distances, font sizes, sequence of occurrence in the page's appearance source code (yes: there is a mini-language PDF has for this) and several more.

It is important to note that all this is by no means reliable. PDF itself knows nothing about a thing such as a line, or word, block, table, text columns, headers or whatever. The only thing that ultimately is known is the position of each character (actually: each glyph - a very different thing).

The base library itself makes no effort to establish what people like to call "natural" reading sequence. Compare these two example files (file1, file2) to understand what this means: they look identical, but try to extract text from file2:
Its characters are ordered by an arbitrary permutation. Up to N! physically different, but identical-looking PDFs are possible (N = number of characters on page), but only one of the files will deliver natural reading sequence with a naive text extraction.

Realistically, with an alphabet containing 26 characters, "only" 26! different files (~ 4.0 e+26) are possible of course.

A goal of programmer-friendly text extraction therefore is to preprocess raw PDF data before returning them - but at the same time keeping this as an option because of inevitable performance implications.
We are currently investing a lot of effort to offer advanced features that allow detection of text columns, paragraphs and tables already in the C base library, in addition to detection of underlined and strike-out text, visibility etc.

1 reply

anakaft Feb 25, 2025
Author

appreciate it for your time and immediate response
Since the block series hierarchy is created inside MuPdf, it will take me some time learning the base library. it seems that the decision-making conditions like inter-line distances, font sizes, sequence of occurrence are not modifiable in pymupdf. If you may suggest which file should I start with to modify the logic (for instance I wanna separate the Subtask from paragraph (1))

Pdf format is like cyper printing ,I have to deal with it because the xml source is not accessible, thanks for your efforts anyway

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page.get_text("dict") what‘s the logic to separate different BLOCKS #4326

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Page.get_text("dict") what‘s the logic to separate different BLOCKS #4326

anakaft Feb 25, 2025

Replies: 1 comment · 1 reply

JorjMcKie Feb 25, 2025 Maintainer

anakaft Feb 25, 2025 Author

anakaft
Feb 25, 2025

Replies: 1 comment 1 reply

JorjMcKie
Feb 25, 2025
Maintainer

anakaft Feb 25, 2025
Author