Bounding boxes for extracted text #137
-
@JorjMcKie Hi there, any chance that it will be possible in the future to obtain bounding boxes for the extracted text elements? That way it would be possible to map the extracted text back onto the original PDF-page, for example, to visualize the chunk. This would be super helpful for endusers. :) |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 2 replies
-
I think you can do this with |
Beta Was this translation helpful? Give feedback.
-
I fully agree with @jamie-lemon 's comment. |
Beta Was this translation helpful? Give feedback.
-
This package's parent, PyMuPDF lets you extract all text detail, so you can get to know each single character's position (in addition to higher aggregates like words, spans, lines, blocks), the text color, font size, font, font attributes, starting points and what not else. What is more: |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Thanks for taking the time to respond! ❤️ I understand the purpose of the package and markdown approach, but initially I was hoping that there is some way to return boundary boxes of text as |
Beta Was this translation helpful? Give feedback.
-
But before you run away in frustration, you could try parameter |
Beta Was this translation helpful? Give feedback.
But before you run away in frustration, you could try parameter
extract_words=True
. This will enforcepage_chunks=True
. In the page dictionary you will then find a list of word tuples(x0, y0, x1, y1, "wordstring",...)
. Just like you would get it usingpage.get_text("words")
... but in the sequence as in the markdown text.This is (and probably ever will be) the maximum we can provide.