Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_highlight_annot using clip generates "A Number is Out of Range" error in PDF #2322

Closed
TheCapybaraClub opened this issue Apr 6, 2023 · 4 comments
Assignees

Comments

@TheCapybaraClub
Copy link

Describe the bug (mandatory)

I am trying use page.add_highlight_annot with the clip option, and while the highlighting is placed as expected, the resulting PDF contains "A Number is Out of Range" error. The clip is built from information within the results of page.get_text("words", textpage=textpage) so I am not sure how my clip could be illegal. If this is not a bug, what I am doing wrong?

To Reproduce (mandatory)

test.pdf

Import Fitz and read PDF

import fitz
pdfDoc = fitz.open('./test.pdf')

Get Text and Do Text Stuff with it (here we find the index of target)

page = pdfDoc[0]
textpage = page.get_textpage(clip=page.mediabox)
page_text_words = page.get_text("words", textpage=textpage)

# xi = list index of key word
for xi, x in enumerate(page_text_words):
    if x[4]=='pellentesque,':
        target_idx = xi
        print(xi, x)

# results seem reasonable
# 88 (242.8954315185547, 157.6929473876953, 308.8439025878906, 172.07373046875, 'pellentesque,', 3, 6, 5)

Get context around this target using the list index

context_span = 10
start_idx = target_idx-context_span
end_idx = target_idx+context_span

context_text = " ".join([x[4] for x in page_text_words[start_idx:(end_idx+1)] ])

# results seem reasonable
# 'et maximus urna. Nullam posuere feugiat orci non ullamcorper. Proin pellentesque, odio id facilisis mollis, sem risus suscipit ex, non aliquet'

Build a clip for this context text

clip_rect = list(page_text_words[start_idx][:4])
for xi, x in enumerate(page_text_words[start_idx:(end_idx+1)]):
    if x[0]<clip_rect[0]:
        clip_rect[0]=x[0]
    if x[1]<clip_rect[1]:
        clip_rect[1]=x[1]
    if x[2]>clip_rect[2]:
        clip_rect[2]=x[2]
    if x[3]>clip_rect[3]:
        clip_rect[3]=x[3]

# results seem reasonable, even though the method is pretty ugly
# [72.02400207519531, 143.41297912597656, 540.11474609375, 186.353759765625]

Use the clip to add a highlight annotation

x0,y0,x1,y1 = clip_rect
rect = fitz.Rect(x0,y0,x1,y1)
highlight = page.add_highlight_annot(quads=None, clip=rect)
highlight.update()

Save PDF

pdfDoc.save(f"./test_out.pdf", garbage=4, clean=True, deflate=True, deflate_images=True, deflate_fonts=True)
print(f"Info: Saved Annotated PDF ./test_out.pdf")

# opening this PDF shows the highlighting as expected but also pops up "A Number is Out of Range" error

Expected behavior (optional)

I would expect not to get the "A Number is Out of Range" error

Screenshots (optional)

At first the highlighting doesn't show, only the error. But once you click 'Ok' the error goes away and highlighting shows. Any scrolling brings the error prompt back up.

Error

After clicking 'Ok' and followed by and scrolling

ErrorWithHighlight

Your configuration (mandatory)

3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)] 
 win32 
 
PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.10 on win32 (64-bit).

Additional context (optional)

As always, thank you for the support!

@JorjMcKie
Copy link
Collaborator

This is not a bug, but incorrect use of the method - actually amazing that something was highlighted at all:
If you set quads to None then start and stop must not be None. Probably a few plausibility checks should be inserted into the method and the documentation be updated.

The clip != None parameter also only makes sense and is intended only for start != None and stop != None.

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Apr 6, 2023
@JorjMcKie JorjMcKie self-assigned this Apr 6, 2023
@JorjMcKie
Copy link
Collaborator

As an aside: Because of quad=None, coordinates of MuPDF's infinite rectangle are inserted - these are the numbers the PDF viewer does not like.
You simply should have used your rectangle as the highlight rectangle.

@TheCapybaraClub
Copy link
Author

I do not understand what the proper use of the method should be. As you say, If you set quads to None then start and stop must not be None, so I tried these... but got the same issue.

highlight = page.add_highlight_annot(quads=None, clip=rect, start=start_context_clip, stop=end_context_clip)
highlight = page.add_highlight_annot(quads=None, start=start_context_clip, stop=end_context_clip)

In your follow up, you said I simply should have used my rectangle as the highlight rectangle. Do you mean to say I should use the method this way, where I set quads equal to rect? I agree this will avoid the error, but doesn't accomplish the same goal of "to highlight consecutive lines between the points start and stop" and rather just throws one highlight across all rows.

highlight = page.add_highlight_annot(quads=rect, start=start_context_clip, stop=end_context_clip)

I would like to understand how to duplicate the example shown in the second note within the documentation
https://pymupdf.readthedocs.io/en/latest/page.html#Page.add_highlight_annot

@JorjMcKie
Copy link
Collaborator

This is correct: page.add_highlight_annot(quads=None, start=start_context_clip, stop=end_context_clip). Using clip (in this scenario only) is also correct.

If using quads, all remaining parameters must be None.

Both ways are mutually exclusive.

Currently however, the method internally may generate infinite rectangles / quads which are the reason for the PDF viewer's complaint. This is fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants