Bounding boxes for extracted text #137

simonschoe · 2024-09-13T06:25:52Z

simonschoe
Sep 13, 2024

@JorjMcKie Hi there, any chance that it will be possible in the future to obtain bounding boxes for the extracted text elements? That way it would be possible to map the extracted text back onto the original PDF-page, for example, to visualize the chunk. This would be super helpful for endusers. :)

Answered by JorjMcKie

Sep 13, 2024

But before you run away in frustration, you could try parameter extract_words=True. This will enforce page_chunks=True. In the page dictionary you will then find a list of word tuples (x0, y0, x1, y1, "wordstring",...). Just like you would get it using page.get_text("words") ... but in the sequence as in the markdown text.
This is (and probably ever will be) the maximum we can provide.

View full answer

jamie-lemon · 2024-09-13T11:55:17Z

jamie-lemon
Sep 13, 2024
Maintainer

I think you can do this with text_blocks = page.get_text("blocks") , see: https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_text

0 replies

JorjMcKie · 2024-09-13T12:00:23Z

JorjMcKie
Sep 13, 2024
Maintainer

I fully agree with @jamie-lemon 's comment.
Otherwise: this is no issue, but rather a Discussions item. Let's not bloat the Issues with sheer questions!

0 replies

JorjMcKie · 2024-09-13T12:08:50Z

JorjMcKie
Sep 13, 2024
Maintainer

This package's parent, PyMuPDF lets you extract all text detail, so you can get to know each single character's position (in addition to higher aggregates like words, spans, lines, blocks), the text color, font size, font, font attributes, starting points and what not else.
This package serves a totally different mission in life, namely providing text data to LLMs / RAGs.
Re-importing above details into the markdown format is therefore profoundly contradiction its purpose.

What is more:
Markdown, as a superset format of HTML, has no notion of a thing like a page. Any position info is however valid only WRT to the page on which something is displayed and makes no sense in Markdown.

0 replies

simonschoe · 2024-09-13T13:52:24Z

simonschoe
Sep 13, 2024
Author

@JorjMcKie Thanks for taking the time to respond! ❤️

I understand the purpose of the package and markdown approach, but initially I was hoping that there is some way to return boundary boxes of text as metadata in the output of to_markdown somehow. Because I infer that the positional information of the raw text is used internally to sort the text.

1 reply

JorjMcKie Sep 13, 2024
Maintainer

Yes of course the package internally makes excessive use of information provided PyMuPDF.
But we cannot bloat the generated Markdown output by all sorts of stuff which just as well could be requested from PyMuPDF in the same or similar way.
We have to care about keeping the size of the output within reasonable limits.

If you look at the helpers sub folder of the package you will find several scripts that access and massage "native" PyMuPDF data. Most prominently functions like get_raw_lines or get_text_lines. Also of interest is script multi_column which tries to make sense of the page's layout by looking at its text.

JorjMcKie · 2024-09-13T15:01:33Z

JorjMcKie
Sep 13, 2024
Maintainer

But before you run away in frustration, you could try parameter extract_words=True. This will enforce page_chunks=True. In the page dictionary you will then find a list of word tuples (x0, y0, x1, y1, "wordstring",...). Just like you would get it using page.get_text("words") ... but in the sequence as in the markdown text.
This is (and probably ever will be) the maximum we can provide.

1 reply

simonschoe Sep 13, 2024
Author

Thank you for your answer! That is exactly what I was looking for as it should allow me to map bb coords of words back to chunks in the extracted page text. I really appreciate your help @JorjMcKie , thank you so much! 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bounding boxes for extracted text #137

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Bounding boxes for extracted text #137

simonschoe Sep 13, 2024

Replies: 5 comments · 2 replies

jamie-lemon Sep 13, 2024 Maintainer

JorjMcKie Sep 13, 2024 Maintainer

JorjMcKie Sep 13, 2024 Maintainer

simonschoe Sep 13, 2024 Author

JorjMcKie Sep 13, 2024 Maintainer

JorjMcKie Sep 13, 2024 Maintainer

simonschoe Sep 13, 2024 Author

simonschoe
Sep 13, 2024

Replies: 5 comments 2 replies

jamie-lemon
Sep 13, 2024
Maintainer

JorjMcKie
Sep 13, 2024
Maintainer

JorjMcKie
Sep 13, 2024
Maintainer

simonschoe
Sep 13, 2024
Author

JorjMcKie Sep 13, 2024
Maintainer

JorjMcKie
Sep 13, 2024
Maintainer

simonschoe Sep 13, 2024
Author