How to find images not included by page.get_text('rawdict')? #4370
Replies: 1 comment
-
I missed this part in the
Passing this option, and manually clipping the block bounding box to the page bounding box solved my problem. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have some scanned PDFs where each page's content is contained in an image, but the image doesn't get reported as a
type=1
image block when callingpage.get_text('rawdict')
. This usually happens when the image exceeds the page bounds by a few pixels, possibly due to bad scanner / doc-management software. It looks like this is by design, based on the notes here and here.The images missed by
page.get_text()
do get reported bypage.get_image_info()
.But the dicts from both calls differ, so a direct comparison may not be reliable (no hash, etc.) -
Image details from

page.get_text('rawdict')
:Image details from

page.get_image_info()
:Question: Is there a reliable way to check if the images reported by
page.get_image_info()
were already included bypage.get_text()
? Or, is there a better way to get images not reported bypage.get_text('rawdict')
?Beta Was this translation helpful? Give feedback.
All reactions