-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annot.get_text("words") - doesn't return the first line of words #2270
Comments
The output from the example code:
The y0 coords don't look right to me? Shouldn't the word "cover" be contained within the textbox? |
As this is a FreeText annot, the coordinates don't count here anyway. The text returned for this annotation type is the content of this dictionary value |
Thanks for the quick reply. It makes sense to take the value from Basically I'm seeking to only get visible text. Is there a simple way to do that? Interesting that only carriage returns seem to be present. The textboxes were added using the latest version of Acrobat Reader (2022.003.20322) on a Windows 11 machine. Would have thought it should correctly put in CRLF's? |
A PDF stores annotations for each page in an array in the sequence LILO: the last annotation in that array is the most recent one added. This sequence is also returned if you iterate over |
There is a general lack of specification detail when it comes to annotations. So viewers differ in fucntionality (support of border dashing options and border line variants, etc., etc.). |
Thanks, that's very helpful. So I can always get the latest amendment as the last item in the array. Still not sure how I can get around situations where the the latest amendment only covers a small section of the previous amendment though? I will still need to use some text from the previous amendment, but only mask certain words which are replaced by the current amendment. But I don't have word coords? |
Taken for granted at the moment, that |
If that fails in the same ways, as a last resort you can try converting the PDF to a PDF where all annotations have been converted to standard text: pdfbytes = doc.convert_to_pdf()
doc2 = fitz.open("pdf", pdfbytes) # an intermediate, converted version.
page=doc[0]
words = page.get_text("words") # which is the normal full page list of words Here you will find all annot text as the last items with full position info per word. |
Sorry, I actually have to correct myself: In [4]: page.get_text("words")
Out[4]:
[(72.0, 74.15966796875, 96.0569839477539, 85.20011138916016, 'Multi', 0, 0, 0),
(98.4481430053711,
74.15966796875,
114.80936431884766,
85.20011138916016,
'line',
0,
0,
1),
(72.0,
87.59966278076172,
110.04363250732422,
98.64010620117188,
'Example',
0,
1,
0),
(112.53959655761719,
87.59966278076172,
140.95494079589844,
98.64010620117188,
'where',
0,
1,
1),
(72.0,
101.03965759277344,
74.78167724609375,
112.0801010131836,
'I',
0,
2,
0),
(77.27763366699219,
101.03965759277344,
98.87897491455078,
112.0801010131836,
'have',
0,
2,
1),
(101.3078842163086,
101.03965759277344,
137.33685302734375,
112.0801010131836,
'covered',
0,
2,
2),
(139.77330017089844,
101.03965759277344,
151.31488037109375,
112.0801010131836,
'up',
0,
2,
3),
(72.0,
114.47966003417969,
94.72760009765625,
125.52010345458984,
'Lines',
0,
3,
0),
(97.25460815429688,
114.47966003417969,
117.17918395996094,
125.52010345458984,
'with',
0,
3,
1),
(119.61563110351562,
114.47966003417969,
163.45700073242188,
125.52010345458984,
'textboxes',
0,
3,
2),
(72.0,
127.79966735839844,
80.58222961425781,
138.84011840820312,
'In',
0,
4,
0),
(83.0186767578125,
127.79966735839844,
88.30709838867188,
138.84011840820312,
'a',
0,
4,
1),
(90.80306243896484,
127.79966735839844,
107.71954345703125,
138.84011840820312,
'silly',
0,
4,
2),
(110.11933135986328,
127.79966735839844,
128.35418701171875,
138.84011840820312,
'way',
0,
4,
3),
(71.802001953125,
83.93902587890625,
96.25199890136719,
97.67902374267578,
'cover',
1,
0,
0),
(99.03300476074219,
83.93902587890625,
104.03300476074219,
97.67902374267578,
'it',
1,
0,
1),
(71.802001953125,
95.93902587890625,
82.9219970703125,
109.67902374267578,
'up',
1,
1,
0)]
In [5]: |
Thanks. I've got a section in my code that pulls the words list like that. I think I can work something together using the rects of the textboxes and the full words array. Need the textbox rects in case customers are trying to mask out words with blank textboxes. I'm happy to close this. |
Sorry to reopen this, but I'm still having issues. Reading the documentation around Annots I should be able to use the "get_text()" and "get_textbox(rect)" methods. I've created a new example. This time with a completely blank doc and only one textbox annotation.
Result:
None of the get_text functions seem to be working for my annotation. |
With our latest tree (not yet pushed to github), the output for your new example is:
Is this the expected output? @JorjMcKie may be able to explain why our latest tree is behaving differently from the current release 1.21.1. We'll be pushing the tree to github in the next few days, and making a new release soon. |
That's what I'm looking for! Perfect. I'll close when the new tree arrives. |
This test passes with current tree.
This test passes with current tree.
This test passes with current tree.
This test passes with current tree.
Please provide all mandatory information!
Describe the bug (mandatory)
After getting "TEXT" page annotations I can't get words from the first line of text in the textbox using the .get_text("words") method. It looks as if the bounding rect for the textbox is reading a couple of pixels too far down the page and is missing the first line of the textbox? I could of made a mistake... but I could do with some guidance?
To Reproduce (mandatory)
Your configuration (mandatory)
Additional context (optional)
I'm trying to get the content of textboxes that have been placed over the top of existing text, and not read text behind the textbox. My method at the moment is to find the bounding rect of the textbox and remove any text from behind the rect shape. Any guidance on if that is the right approach, or if there's a better method would be great?
multiline textbox hell.pdf
The text was updated successfully, but these errors were encountered: