Skip to content

how to recognition the hidden text layer by pymupdf #3961

Answered by JorjMcKie
ytcpub asked this question in Q&A
Discussion options

You must be logged in to vote

There is no easy way yet. You can however make a pixmap of the text region (using its bbox) and check which colors occur in that text rectangle. If e.g. only one color, then you know that the text is invisible, etc.
This works for any type of overlap - vector graphics or images.

Here is a demo:

import pymupdf

doc = pymupdf.open("test.pdf")
page = doc[0]
rl = page.search_for("two lines probably")
bbox = rl[0]
pix = page.get_pixmap(clip=bbox)
percent, color = pix.color_topusage()
print(f"{percent*100}% of the region contains color {tuple(map(int, color))}")

This prints: 100.0% of the region contains color (249, 199, 49). So you know that the searched text is invisible.

In one of the next v…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@ytcpub
Comment options

Answer selected by ytcpub
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants