How to find images not included by page.get_text('rawdict')? #4370

lo5 · 2025-03-12T23:43:14Z

lo5
Mar 12, 2025

I have some scanned PDFs where each page's content is contained in an image, but the image doesn't get reported as a type=1 image block when calling page.get_text('rawdict'). This usually happens when the image exceeds the page bounds by a few pixels, possibly due to bad scanner / doc-management software. It looks like this is by design, based on the notes here and here.

The images missed by page.get_text() do get reported by page.get_image_info().

But the dicts from both calls differ, so a direct comparison may not be reliable (no hash, etc.) -

Image details from page.get_text('rawdict'):

Image details from page.get_image_info():

Question: Is there a reliable way to check if the images reported by page.get_image_info() were already included by page.get_text()? Or, is there a better way to get images not reported by page.get_text('rawdict')?

lo5 · 2025-03-13T01:11:30Z

lo5
Mar 13, 2025
Author

I missed this part in the get_text() docs:

To avoid clipping altogether use clip=pymupdf.INFINITE_RECT(). Only then the extraction will contain all items.

Passing this option, and manually clipping the block bounding box to the page bounding box solved my problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to find images not included by page.get_text('rawdict')? #4370

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to find images not included by page.get_text('rawdict')? #4370

lo5 Mar 12, 2025

Replies: 1 comment

lo5 Mar 13, 2025 Author

lo5
Mar 12, 2025

lo5
Mar 13, 2025
Author