-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it posible to extract highlighted text? #318
Comments
You can extract the text (and images) from pages via |
Thanks for the quick reply @JorjMcKie ! Just to make sure I understand correctly, when you say "It would be your task to relate any annotations or links to those data: they are not be contained in that dict." Do you mean that I will need to manually search for any text snippet I want to extract? Or is there a function within pymupdf which can extract the coordinates of the highlighted areas/texts and then I can use these coordinates to extract the relevant snippets from the dictionary containing all the extracted text from a page? |
Not quite.
The basic PDF page's text itself can only have properties inherited from the font in use, like italic, bold, monospaced, serifed. It cannot directly be highlighted or underlined. Any such effects come from outside the actual text specification (for which annots are just an example). |
Thanks for taking the time to elaborate on your explanation. This makes sense to me. I have been able to extract the highlight rectangles coordinates using |
Bah, this is treason ... ;-) You don't need another package for locating annotations. Why not using PyMuPDF for this? |
scan through a page's annotations like this to find underline / highlight / strikeout / squiggly underlined annotations: annot = page.firstAnnot
while annot:
if annot.type[0] in (8, 9, 10, 11): # one of the 4 types above
rect = annot.rect # this is the rectangle the annot covers
# extract the text within that rect ...
annot = annot.next # None returned after last annot |
Ah that's great thanks! I saw that the rectangle area in the wiki example was constructed by searching for the actual words, which is why I wasn't sure if there was a way to find the highlighted area coordinates/rectangle using PyMuPDF. I will start off from your code, thanks again! |
Hi, @JorjMcKie . Thanks for the wonderful package. Some enhancement may be required, because many words out of the highlighted area can be extracted as well. Here is my code. import fitz
from itertools import groupby
def print_hightlight_text(page, rect):
"""Return text containted in the given rectangular highlighted area.
Args:
page (fitz.page): the associated page.
rect (fitz.Rect): rectangular highlighted area.
"""
words = page.getText("words") # list of words on page
words.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]
group = groupby(mywords, key=lambda w: w[3])
for y1, gwords in group:
print(" ".join(w[4] for w in gwords))
def main():
doc = fitz.open('./PDF/sample-3.pdf')
page = doc[0]
annot = page.firstAnnot
print_hightlight_text(page, annot.rect)
if __name__ == "__main__":
main() The output of this file will be:
Even when the fully contained words are extracted with the first method in this wiki post, the words before and after the highlight will be selected anyway:
Any hint as to this problem? My program will parse lots of highlights, so the introduction of that noise will have a huge impact. |
Yor next question will probably be: That should also be possible:
>>> len(annot.vertices) # always a multiple of 4!
8
>>> from pprint import pprint
>>> pprint(annot.vertices)
[(304.84161376953125, 274.20062255859375),
(388.37060546875, 274.20062255859375),
(304.84161376953125, 289.2906188964844),
(388.37060546875, 289.2906188964844),
(51.02360153198242, 286.20062255859375),
(182.28359985351562, 286.20062255859375),
(51.02360153198242, 301.2906188964844),
(182.28359985351562, 301.2906188964844)]
>>> So if you do >>> for word in wlist:
if fitz.Rect(word[:4] in rect1:
SyntaxError: invalid syntax
>>> for word in wlist:
if fitz.Rect(word[:4]) in rect1:
print(word[4])
optimal
operation
of
>>> for word in wlist:
if fitz.Rect(word[:4]) in rect2:
print(word[4])
combined
heat
and
power
plants
>>> This should be what you wanted ... 😎 |
Thanks for the reply. The method works. I did try Will write more code to make the program robust, so that it knows when to include the intersections. |
Here is a more compact code snippet: points = annot.vertices
quad_count = int(len(points) / 4)
highlight_words = []
for i in range(quad_count):
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
for w in wordlist:
if fitz.Rect(w[:4]) in r:
highlight_words.append(w[4])
print(" ".join(highlight_words)) Delivers exactly the highlighted text. |
I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word. _threshold_intersection = 0.9 # if the intersection is large enough.
def _check_contain(r_word, points):
"""If `r_word` is contained in the rectangular area.
The area of the intersection should be large enough compared to the
area of the given word.
Args:
r_word (fitz.Rect): rectangular area of a single word.
points (list): list of points in the rectangular area of the
given part of a highlight.
Returns:
bool: whether `r_word` is contained in the rectangular area.
"""
# `r` is mutable, so everytime a new `r` should be initiated.
r = fitz.Quad(points).rect
r.intersect(r_word)
if r.getArea() >= r_word.getArea() * _threshold_intersection:
contain = True
else:
contain = False
return contain
def _extract_annot(annot, words_on_page):
"""Extract words in a given highlight.
Args:
annot (fitz.Annot): [description]
words_on_page (list): [description]
Returns:
str: words in the entire highlight.
"""
quad_points = annot.vertices
quad_count = int(len(quad_points) / 4)
sentences = ['' for i in range(quad_count)]
for i in range(quad_count):
points = quad_points[i * 4: i * 4 + 4]
words = [
w for w in words_on_page if
_check_contain(fitz.Rect(w[:4]), points)
]
sentences[i] = ' '.join(w[4] for w in words)
sentence = ' '.join(sentences)
return sentence |
Excellent! |
Yes, you always foresees the next problem 😅. I just got out of some errors in other part of my package because of the underline annotations. So I debugged it by checking the type of annotations first before passing it to these two functions. |
Hey, @edxu96 @JorjMcKie this thread was really helpful with one of my ongoing project. It was really helpful in extracting text from a paragraph but it seems to fail when I run the same on a table. I have used the enhance method by @edxu96 and called the function Could you please suggest what's wrong here here is the piece of script which calls the above function
Here is the link for the document I used. I have randomly highlighted a few rows and their values EDIT: turns out setting |
@shadowwarrior29 Glad to hear that it work. Do you still need help? By the way, I could't find any highlight in your document. Probably because I am using Mac. |
@edxu96 the script is working as expected so no help needed at the moment :) Sorry, I should have mentioned that the document in the link is not highlighted and I manually highlighted them after downloading for testing purposes. Here is that highlighted doc I've been using to test the script for different highlighted text scenario |
@edxu96 I keep getting this error. |
@dummifiedme - unfortunately the function _check_contain cannot be inspected, but I gues the sequence of |
The
EDIT: Found the error, it was on my part. I didn't pass the "words" argument in the |
Can you paste the entire code. I am trying to work on a similar problem and it would be helpful import fitz _threshold_intersection = 0.1 # if the intersection is large enough. def _check_contain(r_word, points):
def _extract_annot(annot, words_on_page):
def main(): if name == "main": @JorjMcKie @edxu96 |
@Saumya-09 I haven't checked your code and haven't worked on PDF for a while, but this is where I used those functions in zequnyu/cmdict. |
I noticed this helpful wikipost on how to select text underlying a rectangle. Is it also possible to select highlighted text that is not in the shape of a rectangle? Preferably together with any comments/annotations made on that text regions.
The text was updated successfully, but these errors were encountered: