`Page.searchFor` returns a separate hit rectangle for each marked content item #575

hujunyao · 2020-07-31T02:13:05Z

Describe the bug (mandatory)

I use searchfor to search text in pdf, get the repeated result on specific words. the sample pdf is attached.

To Reproduce (mandatory)

the sample code:

import fitz
file_handle = fitz.open(pdfpath)
page = file_handle.loadPage(0)
#key_word ="申请人：" ##correct
#key_word ="（羲和指马）" ##correct
key_word ="申请人：（" ##get repeated result
co_list=page.searchFor(key_word, hit_max=16, quads=False, flags=1)

Expected behavior (optional)

searchfor should return ONE result.
4.pdf

Your configuration (mandatory)

Operating system, win10
Python version, 3.7.4
PyMuPDF 1.16.9, installation method (pip3 install xxxxx ).

JorjMcKie · 2020-07-31T08:49:52Z

Interesting example! If you draw the returned multiple rectangles when searching for "申请人：（" and give them different colors, you will see this:

I have no idea what is happening here, but definitely caused upstream in MuPDF.

JorjMcKie · 2020-08-01T12:03:03Z

Hi @hujunyao - I have some intermediate results / explanations:
The separate rectangles correspond to separate items of a "marked content sequence" (your PDF contains marked content).
So MuPDF subdivides encountered text along rectangles which represent different marked content items. Just as if the searched text is spread across more than one line: you will also get two rectangles, if the search text is continued on a subsequent line.

The surprise here is, that a similar thing happens for different marked content items - even if they are on the same line.
But I suspect this is not a bug - on the contrary: it works as designed!

I understand that you would prefer not being bothered by such technical sophistication. I yet found no place, where this behaviour might be controlled ...

For now, I am removing the bug label and label this issue as "question".
Please let me know your reaction.

JorjMcKie · 2020-08-01T13:08:48Z

There is no (obvious) way to prevent the above from happening.
Also, in every text extraction format, the text appears as it would be expected, meaning there is no subdividing according to marked contetn items.
The following solutions come into my mind:

add a parameter to search methods, which requests joining any adjacent or overlapping rectangles on the same line (identical y1 value). This would solve your case. But in extreme situations, things like ABCABC might be joined incorrectly if searching for ABC ... and I definitely do not want more logic which checks the text in that rectangle.
do not use searchFor at all, but instead do your own search in the output of one of the getText() variants.

JorjMcKie · 2020-08-06T17:19:16Z

I have researched a little more:
As I wrote, search text pieces belonging to different marked content items are handled like search text pieces occurring on more than one line: separate rectangles are return for these pieces, and none of the rectangles contains the complete search string.

It is possible to detect such a situation by checking the following criteria:

the bottom rectangle coordinates (y1) are equal
the rectangles overlap or are at a minimum adjacent: recta.x1 >= rect.x0.

With this intermediate (and admittedly not fully satisfying) result I suggest to close the issue.

hujunyao · 2020-08-12T02:56:32Z

I have researched a little more:
As I wrote, search text pieces belonging to different marked content items are handled like search text pieces occurring on more than one line: separate rectangles are return for these pieces, and none of the rectangles contains the complete search string.

It is possible to detect such a situation by checking the following criteria:

the bottom rectangle coordinates (y1) are equal

the rectangles overlap or are at a minimum adjacent: recta.x1 >= rect.x0.

With this intermediate (and admittedly not fully satisfying) result I suggest to close the issue.

Thanks for JorjMcKie detailed reply. I think add an parameter maybe a solution. a parameter to merge overlap rectangles with same y1 , and the result may be combined.
the sample.pdf are exported by winword without any special process.
we use gettext and search the text, but cannot get the position (x,y) of the text.

JorjMcKie · 2020-10-27T09:01:52Z

The next version 1.18.2 will finally resolve your issue and join overlapping rectangles if they are on the same line (will not work if quads=True option is used).
A solution has become possible now, because I was also able to remove the hit_max parameter from the searchFor() method: the number of returned hits is now longer limited.

>>> doc=fitz.open("4.pdf")
>>> page = doc[0]
>>> needle = "申请人：（"
>>> page.searchFor(needle)
[Rect(90.02400207519531, 74.36576080322266, 137.4199981689453, 85.36920166015625)]
>>> # only one rect returned

JorjMcKie · 2020-10-27T12:02:35Z

Version 1.18.2 being uploaded to PyPI right now.

hujunyao added the bug label Jul 31, 2020

hujunyao assigned JorjMcKie Jul 31, 2020

JorjMcKie added upstream bug bug outside this package and removed bug labels Jul 31, 2020

JorjMcKie added question and removed upstream bug bug outside this package labels Aug 1, 2020

JorjMcKie changed the title ~~searchfor return Repeated results。~~ Page.searchFor returns a separate hit rectangle for each marked content item Aug 19, 2020

JorjMcKie closed this as completed Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Page.searchFor` returns a separate hit rectangle for each marked content item #575

`Page.searchFor` returns a separate hit rectangle for each marked content item #575

hujunyao commented Jul 31, 2020

JorjMcKie commented Jul 31, 2020

JorjMcKie commented Aug 1, 2020

JorjMcKie commented Aug 1, 2020 •

edited

Loading

JorjMcKie commented Aug 6, 2020

hujunyao commented Aug 12, 2020

JorjMcKie commented Oct 27, 2020

JorjMcKie commented Oct 27, 2020

Page.searchFor returns a separate hit rectangle for each marked content item #575

Page.searchFor returns a separate hit rectangle for each marked content item #575

Comments

hujunyao commented Jul 31, 2020

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

JorjMcKie commented Jul 31, 2020

JorjMcKie commented Aug 1, 2020

JorjMcKie commented Aug 1, 2020 • edited Loading

JorjMcKie commented Aug 6, 2020

hujunyao commented Aug 12, 2020

JorjMcKie commented Oct 27, 2020

JorjMcKie commented Oct 27, 2020

`Page.searchFor` returns a separate hit rectangle for each marked content item #575

`Page.searchFor` returns a separate hit rectangle for each marked content item #575

JorjMcKie commented Aug 1, 2020 •

edited

Loading