Page.get_text("dict") what‘s the logic to separate different BLOCKS #4326
Replies: 1 comment 1 reply
-
The hierarchy It is important to note that all this is by no means reliable. PDF itself knows nothing about a thing such as a line, or word, block, table, text columns, headers or whatever. The only thing that ultimately is known is the position of each character (actually: each glyph - a very different thing). The base library itself makes no effort to establish what people like to call "natural" reading sequence. Compare these two example files (file1, file2) to understand what this means: they look identical, but try to extract text from file2:
A goal of programmer-friendly text extraction therefore is to preprocess raw PDF data before returning them - but at the same time keeping this as an option because of inevitable performance implications. |
Beta Was this translation helpful? Give feedback.
-
Hi, I’m a new programmer ,while extracting text from pdf ,I used page.get_text("dict") and draw rect to show which block it's from,

how those blocks are separated? I tried to check the source code but couldnot find it
I want to know the logic, like they are separated by a certain y0 coordinate
for example ,why the subtask XXX is regarded a same blcok with paragraph(1),
thanks
Beta Was this translation helpful? Give feedback.
All reactions