Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annot.get_text("words") - doesn't return the first line of words #2270

Closed
dsclee1 opened this issue Feb 27, 2023 · 13 comments
Closed

Annot.get_text("words") - doesn't return the first line of words #2270

dsclee1 opened this issue Feb 27, 2023 · 13 comments

Comments

@dsclee1
Copy link

dsclee1 commented Feb 27, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

After getting "TEXT" page annotations I can't get words from the first line of text in the textbox using the .get_text("words") method. It looks as if the bounding rect for the textbox is reading a couple of pixels too far down the page and is missing the first line of the textbox? I could of made a mistake... but I could do with some guidance?

To Reproduce (mandatory)

import fitz
with fitz.open("multiline textbox hell.pdf") as document:
    for page_number, page in enumerate(document):
        words = page.get_text("words")
        textBoxes = []
        for textBox in page.annots(types=(fitz.PDF_ANNOT_FREE_TEXT,fitz.PDF_ANNOT_TEXT)):
            textBoxes.append({ "rect" : textBox.rect, "info" : textBox.info, "words" : textBox.get_text("words")})
print("First word in textbox: 'cover' - x0: " + str(words[15][0]) +" y0: "+str(words[15][1]))
print("Textbox location: - x0: " + str(textBoxes[0]['rect'].x0) + " y0: " + str(textBoxes[0]['rect'].y0))

Your configuration (mandatory)

Additional context (optional)

I'm trying to get the content of textboxes that have been placed over the top of existing text, and not read text behind the textbox. My method at the moment is to find the bounding rect of the textbox and remove any text from behind the rect shape. Any guidance on if that is the right approach, or if there's a better method would be great?
multiline textbox hell.pdf

@dsclee1
Copy link
Author

dsclee1 commented Feb 27, 2023

The output from the example code:

First word in textbox: 'cover' -x0: 71.802001953125 y0: 83.93902587890625
Textbox location: - x0: 71.80169677734375 y0: 86.79302978515625

The y0 coords don't look right to me? Shouldn't the word "cover" be contained within the textbox?

@JorjMcKie
Copy link
Collaborator

The output from the example code:

First word in textbox: 'cover' -x0: 71.802001953125 y0: 83.93902587890625
Textbox location: - x0: 71.80169677734375 y0: 86.79302978515625

The y0 coords don't look right to me? Shouldn't the word "cover" be contained within the textbox?

As this is a FreeText annot, the coordinates don't count here anyway. The text returned for this annotation type is the content of this dictionary value Annot.info["content"].
In PyMuPDF however, I simply call the same MuPDF functions as for normal (page) text ... which should work.
Presumably things go awkward here, because there is no newline between "cover" and "up", insted a carriage return \r only. Between the different operating systems, the only two alternatives for newline are "CRLF" (Windows) and "LF" (oitherwise). CR alone really goes back to the begin of the line ... and thus overwrites "cover".
But if you simply take Annot.info["content"] you will get the complete answer.

@dsclee1
Copy link
Author

dsclee1 commented Feb 27, 2023

Thanks for the quick reply. It makes sense to take the value from Annot.info["content"]. Unfortunately that does still cause issues for me in situations where textboxes are double stacked (I have times where customers will amend addresses by continuingly adding textboxes over other textboxes to amend say house numbers and streets) and I need to take the text from only the top textbox (I've attached an example with a textbox stacked over another textbox multiline textbox hell 2.pdf). As the Annot.info["content"] doesn't contain the coordinates for each word I can't calculate which words are masked by the next textbox stacked on top.

Basically I'm seeking to only get visible text. Is there a simple way to do that?

Interesting that only carriage returns seem to be present. The textboxes were added using the latest version of Acrobat Reader (2022.003.20322) on a Windows 11 machine. Would have thought it should correctly put in CRLF's?

@JorjMcKie
Copy link
Collaborator

A PDF stores annotations for each page in an array in the sequence LILO: the last annotation in that array is the most recent one added. This sequence is also returned if you iterate over page.annots() or over the xref numbers pertaining to annotations: page.get_annot_xrefs().
So you should be able to find and tell apart anything from each other.

@JorjMcKie
Copy link
Collaborator

Interesting that only carriage returns seem to be present. The textboxes were added using the latest version of Acrobat Reader (2022.003.20322) on a Windows 11 machine. Would have thought it should correctly put in CRLF's?

There is a general lack of specification detail when it comes to annotations. So viewers differ in fucntionality (support of border dashing options and border line variants, etc., etc.).

@dsclee1
Copy link
Author

dsclee1 commented Feb 27, 2023

A PDF stores annotations for each page in an array in the sequence LILO: the last annotation in that array is the most recent one added. This sequence is also returned if you iterate over page.annots() or over the xref numbers pertaining to annotations: page.get_annot_xrefs(). So you should be able to find and tell apart anything from each other.

Thanks, that's very helpful. So I can always get the latest amendment as the last item in the array.

Still not sure how I can get around situations where the the latest amendment only covers a small section of the previous amendment though? I will still need to use some text from the previous amendment, but only mask certain words which are replaced by the current amendment. But I don't have word coords?

@JorjMcKie
Copy link
Collaborator

Taken for granted at the moment, that get_text() does work, then you can used all the text extraction variants, including "dict" or "words", which both deliver coordinates.

@JorjMcKie
Copy link
Collaborator

If that fails in the same ways, as a last resort you can try converting the PDF to a PDF where all annotations have been converted to standard text:

pdfbytes = doc.convert_to_pdf()
doc2 = fitz.open("pdf", pdfbytes)  # an intermediate, converted version.
page=doc[0]
words = page.get_text("words")  # which is the normal full page list of words

Here you will find all annot text as the last items with full position info per word.

@JorjMcKie
Copy link
Collaborator

Sorry, I actually have to correct myself:
That list of words is also extractable in the original PDF, and in the same sequence:

In [4]: page.get_text("words")
Out[4]:
[(72.0, 74.15966796875, 96.0569839477539, 85.20011138916016, 'Multi', 0, 0, 0),
 (98.4481430053711,
  74.15966796875,
  114.80936431884766,
  85.20011138916016,
  'line',
  0,
  0,
  1),
 (72.0,
  87.59966278076172,
  110.04363250732422,
  98.64010620117188,
  'Example',
  0,
  1,
  0),
 (112.53959655761719,
  87.59966278076172,
  140.95494079589844,
  98.64010620117188,
  'where',
  0,
  1,
  1),
 (72.0,
  101.03965759277344,
  74.78167724609375,
  112.0801010131836,
  'I',
  0,
  2,
  0),
 (77.27763366699219,
  101.03965759277344,
  98.87897491455078,
  112.0801010131836,
  'have',
  0,
  2,
  1),
 (101.3078842163086,
  101.03965759277344,
  137.33685302734375,
  112.0801010131836,
  'covered',
  0,
  2,
  2),
 (139.77330017089844,
  101.03965759277344,
  151.31488037109375,
  112.0801010131836,
  'up',
  0,
  2,
  3),
 (72.0,
  114.47966003417969,
  94.72760009765625,
  125.52010345458984,
  'Lines',
  0,
  3,
  0),
 (97.25460815429688,
  114.47966003417969,
  117.17918395996094,
  125.52010345458984,
  'with',
  0,
  3,
  1),
 (119.61563110351562,
  114.47966003417969,
  163.45700073242188,
  125.52010345458984,
  'textboxes',
  0,
  3,
  2),
 (72.0,
  127.79966735839844,
  80.58222961425781,
  138.84011840820312,
  'In',
  0,
  4,
  0),
 (83.0186767578125,
  127.79966735839844,
  88.30709838867188,
  138.84011840820312,
  'a',
  0,
  4,
  1),
 (90.80306243896484,
  127.79966735839844,
  107.71954345703125,
  138.84011840820312,
  'silly',
  0,
  4,
  2),
 (110.11933135986328,
  127.79966735839844,
  128.35418701171875,
  138.84011840820312,
  'way',
  0,
  4,
  3),
 (71.802001953125,
  83.93902587890625,
  96.25199890136719,
  97.67902374267578,
  'cover',
  1,
  0,
  0),
 (99.03300476074219,
  83.93902587890625,
  104.03300476074219,
  97.67902374267578,
  'it',
  1,
  0,
  1),
 (71.802001953125,
  95.93902587890625,
  82.9219970703125,
  109.67902374267578,
  'up',
  1,
  1,
  0)]
In [5]:

@dsclee1
Copy link
Author

dsclee1 commented Feb 27, 2023

Thanks. I've got a section in my code that pulls the words list like that. I think I can work something together using the rects of the textboxes and the full words array. Need the textbox rects in case customers are trying to mask out words with blank textboxes.

I'm happy to close this.

@dsclee1 dsclee1 closed this as completed Feb 27, 2023
@dsclee1 dsclee1 reopened this Feb 28, 2023
@dsclee1
Copy link
Author

dsclee1 commented Feb 28, 2023

Sorry to reopen this, but I'm still having issues. Reading the documentation around Annots I should be able to use the "get_text()" and "get_textbox(rect)" methods.
https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_text
https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_textbox

I've created a new example. This time with a completely blank doc and only one textbox annotation.
blank with 1 annotation.pdf

import fitz
with fitz.open("blank with 1 annotation.pdf") as document:
    for page_number, page in enumerate(document):
        for textBox in page.annots(types=(fitz.PDF_ANNOT_FREE_TEXT,fitz.PDF_ANNOT_TEXT)):
            print("textBox.type :", textBox.type)
            print("textBox.get_text('words') : ", textBox.get_text('words'))
            print("textBox.get_text('text') : ", textBox.get_text('text'))
            print("textBox.get_textbox() : ", textBox.get_textbox(textBox.rect))
            print("textBox.info['content'] : ", textBox.info['content'])

Result:

textBox.type : (2, 'FreeText')
textBox.get_text('words') :  []
textBox.get_text('text') :
textBox.get_textbox() :
textBox.info['content'] :  abc123

None of the get_text functions seem to be working for my annotation.

@julian-smith-artifex-com
Copy link
Collaborator

With our latest tree (not yet pushed to github), the output for your new example is:

textBox.type : (2, 'FreeText')
textBox.get_text('words') :  [(3.184000015258789, 1.1180419921875, 35.984004974365234, 14.858041763305664, 'abc123', 0, 0, 0)]
textBox.get_text('text') :  abc123

textBox.get_textbox() :  abc123
textBox.info['content'] :  abc123

Is this the expected output?

@JorjMcKie may be able to explain why our latest tree is behaving differently from the current release 1.21.1.

We'll be pushing the tree to github in the next few days, and making a new release soon.

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
@dsclee1
Copy link
Author

dsclee1 commented Mar 13, 2023

With our latest tree (not yet pushed to github), the output for your new example is:

textBox.type : (2, 'FreeText')
textBox.get_text('words') :  [(3.184000015258789, 1.1180419921875, 35.984004974365234, 14.858041763305664, 'abc123', 0, 0, 0)]
textBox.get_text('text') :  abc123

textBox.get_textbox() :  abc123
textBox.info['content'] :  abc123

Is this the expected output?

@JorjMcKie may be able to explain why our latest tree is behaving differently from the current release 1.21.1.

We'll be pushing the tree to github in the next few days, and making a new release soon.

That's what I'm looking for! Perfect. I'll close when the new tree arrives.

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
julian-smith-artifex-com added a commit that referenced this issue Mar 14, 2023
This test passes with current tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants