Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.searchFor returns a separate hit rectangle for each marked content item #575

Closed
hujunyao opened this issue Jul 31, 2020 · 7 comments
Closed
Assignees
Labels

Comments

@hujunyao
Copy link

Describe the bug (mandatory)

I use searchfor to search text in pdf, get the repeated result on specific words. the sample pdf is attached.

To Reproduce (mandatory)

the sample code:

import fitz
file_handle = fitz.open(pdfpath)
page = file_handle.loadPage(0)
#key_word ="申请人:" ##correct
#key_word ="(羲和指马)" ##correct
key_word ="申请人:(" ##get repeated result
co_list=page.searchFor(key_word, hit_max=16, quads=False, flags=1)

Expected behavior (optional)

searchfor should return ONE result.
4.pdf

Your configuration (mandatory)

  • Operating system, win10
  • Python version, 3.7.4
  • PyMuPDF 1.16.9, installation method (pip3 install xxxxx ).
@JorjMcKie
Copy link
Collaborator

Interesting example! If you draw the returned multiple rectangles when searching for "申请人:(" and give them different colors, you will see this:
grafik
I have no idea what is happening here, but definitely caused upstream in MuPDF.

@JorjMcKie JorjMcKie added upstream bug bug outside this package and removed bug labels Jul 31, 2020
@JorjMcKie
Copy link
Collaborator

Hi @hujunyao - I have some intermediate results / explanations:
The separate rectangles correspond to separate items of a "marked content sequence" (your PDF contains marked content).
So MuPDF subdivides encountered text along rectangles which represent different marked content items. Just as if the searched text is spread across more than one line: you will also get two rectangles, if the search text is continued on a subsequent line.

The surprise here is, that a similar thing happens for different marked content items - even if they are on the same line.
But I suspect this is not a bug - on the contrary: it works as designed!

I understand that you would prefer not being bothered by such technical sophistication. I yet found no place, where this behaviour might be controlled ...

For now, I am removing the bug label and label this issue as "question".
Please let me know your reaction.

@JorjMcKie JorjMcKie added question and removed upstream bug bug outside this package labels Aug 1, 2020
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Aug 1, 2020

There is no (obvious) way to prevent the above from happening.
Also, in every text extraction format, the text appears as it would be expected, meaning there is no subdividing according to marked contetn items.
The following solutions come into my mind:

  • add a parameter to search methods, which requests joining any adjacent or overlapping rectangles on the same line (identical y1 value). This would solve your case. But in extreme situations, things like ABCABC might be joined incorrectly if searching for ABC ... and I definitely do not want more logic which checks the text in that rectangle.
  • do not use searchFor at all, but instead do your own search in the output of one of the getText() variants.

@JorjMcKie
Copy link
Collaborator

I have researched a little more:
As I wrote, search text pieces belonging to different marked content items are handled like search text pieces occurring on more than one line: separate rectangles are return for these pieces, and none of the rectangles contains the complete search string.

It is possible to detect such a situation by checking the following criteria:

  • the bottom rectangle coordinates (y1) are equal
  • the rectangles overlap or are at a minimum adjacent: recta.x1 >= rect.x0.

With this intermediate (and admittedly not fully satisfying) result I suggest to close the issue.

@hujunyao
Copy link
Author

I have researched a little more:
As I wrote, search text pieces belonging to different marked content items are handled like search text pieces occurring on more than one line: separate rectangles are return for these pieces, and none of the rectangles contains the complete search string.

It is possible to detect such a situation by checking the following criteria:

  • the bottom rectangle coordinates (y1) are equal
  • the rectangles overlap or are at a minimum adjacent: recta.x1 >= rect.x0.

With this intermediate (and admittedly not fully satisfying) result I suggest to close the issue.

Thanks for JorjMcKie detailed reply. I think add an parameter maybe a solution. a parameter to merge overlap rectangles with same y1 , and the result may be combined.
the sample.pdf are exported by winword without any special process.
we use gettext and search the text, but cannot get the position (x,y) of the text.

@JorjMcKie JorjMcKie changed the title searchfor return Repeated results。 Page.searchFor returns a separate hit rectangle for each marked content item Aug 19, 2020
@JorjMcKie
Copy link
Collaborator

The next version 1.18.2 will finally resolve your issue and join overlapping rectangles if they are on the same line (will not work if quads=True option is used).
A solution has become possible now, because I was also able to remove the hit_max parameter from the searchFor() method: the number of returned hits is now longer limited.

>>> doc=fitz.open("4.pdf")
>>> page = doc[0]
>>> needle = "申请人:("
>>> page.searchFor(needle)
[Rect(90.02400207519531, 74.36576080322266, 137.4199981689453, 85.36920166015625)]
>>> # only one rect returned

@JorjMcKie
Copy link
Collaborator

Version 1.18.2 being uploaded to PyPI right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants