Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it posible to extract highlighted text? #318

Closed
joelostblom opened this issue Jul 4, 2019 · 24 comments
Closed

Is it posible to extract highlighted text? #318

joelostblom opened this issue Jul 4, 2019 · 24 comments

Comments

@joelostblom
Copy link

I noticed this helpful wikipost on how to select text underlying a rectangle. Is it also possible to select highlighted text that is not in the shape of a rectangle? Preferably together with any comments/annotations made on that text regions.

@JorjMcKie
Copy link
Collaborator

You can extract the text (and images) from pages via page.getText("dict"). This works for non-PDF document also.
The result is a dictionary explained here. Except for text colors, this dictionary could be used to reconstruct a full document page in its original look, including images.
It would be your task to relate any annotations or links to those data: they are not be contained in that dict.

@joelostblom
Copy link
Author

Thanks for the quick reply @JorjMcKie !

Just to make sure I understand correctly, when you say "It would be your task to relate any annotations or links to those data: they are not be contained in that dict." Do you mean that I will need to manually search for any text snippet I want to extract? Or is there a function within pymupdf which can extract the coordinates of the highlighted areas/texts and then I can use these coordinates to extract the relevant snippets from the dictionary containing all the extracted text from a page?

@JorjMcKie
Copy link
Collaborator

Not quite.

  • The PDF concept of annotations represents a way to add "comments" or remarks with a reduced permission level. They don't count as "edits".
  • The "normal" page text is not affected by annots - annots are like dust which can be wiped off again, if you follow this metaphor.
  • Some annot types support highlighting and underlining text.

The basic PDF page's text itself can only have properties inherited from the font in use, like italic, bold, monospaced, serifed.

It cannot directly be highlighted or underlined. Any such effects come from outside the actual text specification (for which annots are just an example).
So it is your task to e.g. take the rectangle from a highlighting annotation, then dig your way through the page extraction dictionary and look for text pieces (so-called "spans") which are located within that rectangle.
This is what I was alluding to in the previous post.

@joelostblom
Copy link
Author

Thanks for taking the time to elaborate on your explanation. This makes sense to me. I have been able to extract the highlight rectangles coordinates using PyPDF2 and python-poppler-qt5 (details here), but using those packages I had issues finding the text spans overlapping with those coordinates. Next, I will try using those coordinates in combination with the wiki page on how to extract text from a rectangle with PyMuPDF, and hopefully that should do the trick. Thanks again!

@JorjMcKie
Copy link
Collaborator

Bah, this is treason ... ;-)

You don't need another package for locating annotations. Why not using PyMuPDF for this?

@JorjMcKie
Copy link
Collaborator

scan through a page's annotations like this to find underline / highlight / strikeout / squiggly underlined annotations:

annot = page.firstAnnot

while annot:
    if annot.type[0] in (8, 9, 10, 11): # one of the 4 types above
        rect = annot.rect # this is the rectangle the annot covers
        # extract the text within that rect ...
    annot = annot.next # None returned after last annot

@joelostblom
Copy link
Author

Ah that's great thanks! I saw that the rectangle area in the wiki example was constructed by searching for the actual words, which is why I wasn't sure if there was a way to find the highlighted area coordinates/rectangle using PyMuPDF. I will start off from your code, thanks again!

@edxu96
Copy link

edxu96 commented Jul 11, 2020

Hi, @JorjMcKie . Thanks for the wonderful package. Some enhancement may be required, because many words out of the highlighted area can be extracted as well. Here is my code.

import fitz
from itertools import groupby


def print_hightlight_text(page, rect):
    """Return text containted in the given rectangular highlighted area.

    Args:
        page (fitz.page): the associated page.
        rect (fitz.Rect): rectangular highlighted area.
    """
    words = page.getText("words")  # list of words on page
    words.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x
    mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]
    group = groupby(mywords, key=lambda w: w[3])
    for y1, gwords in group:
        print(" ".join(w[4] for w in gwords))


def main():
    doc = fitz.open('./PDF/sample-3.pdf')
    page = doc[0]
    annot = page.firstAnnot
    print_hightlight_text(page, annot.rect)


if __name__ == "__main__":
    main()

The output of this file will be:

Screenshot 2020-07-11 at 17 19 01

demand for biomass is uncertain at that time, and heat demand and electricity prices
vary drastically during the planning period. Furthermore, the optimal operation of
combined heat and power plants has to consider the existing synergies between the
power and heating systems. We propose a solution method using stochastic optimi-

Even when the fully contained words are extracted with the first method in this wiki post, the words before and after the highlight will be selected anyway:

vary drastically during the planning period. Furthermore, the optimal operation of
combined heat and power plants has to consider the existing synergies between the

Any hint as to this problem? My program will parse lots of highlights, so the introduction of that noise will have a huge impact.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 11, 2020

The problems is this:
grafik
The annot rect is the blue one - not just the highlighted words!
The intersecting words are those surrounded with thinlined green rectangles.
If you change the code from intersects to fitz.Rect(w[:4]) in annot.rect, then only the two lines fully contained in the blue rect will be delivered.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 11, 2020

Yor next question will probably be:
"How can I shrink the selection down to the highlighted words only?"

That should also be possible:

  • The annotation was created surrounding text which spreads across more than 1 line
  • This situation is reflected by a sequence of quadrilaterals - Quad objects in PyMuPDF.
  • Tha annot rect is the smallest rectangle containing all those quads.
  • 1 quad is given by 4 point-like objects.
  • You can extract these points via annot.vertices:
>>> len(annot.vertices)  # always a multiple of 4!
8
>>> from pprint import pprint
>>> pprint(annot.vertices)
[(304.84161376953125, 274.20062255859375),
 (388.37060546875, 274.20062255859375),
 (304.84161376953125, 289.2906188964844),
 (388.37060546875, 289.2906188964844),
 (51.02360153198242, 286.20062255859375),
 (182.28359985351562, 286.20062255859375),
 (51.02360153198242, 301.2906188964844),
 (182.28359985351562, 301.2906188964844)]
>>> 

So if you do rect1 = fitz.Quad(annot.vertices[:4).rect] and rect2 = fitz.Quad(annot.vertices[4:]).rect, you get the two part rectangles and can do this:

>>> for word in wlist:
	if fitz.Rect(word[:4] in rect1:
		     
SyntaxError: invalid syntax
>>> for word in wlist:
	if fitz.Rect(word[:4]) in rect1:
		print(word[4])

		
optimal
operation
of
>>> for word in wlist:
	if fitz.Rect(word[:4]) in rect2:
		print(word[4])

		
combined
heat
and
power
plants
>>> 

This should be what you wanted ... 😎

@edxu96
Copy link

edxu96 commented Jul 11, 2020

The problems is this:
grafik
The annot rect is the blue one - not just the highlighted words!
The intersecting words are those surrounded with thinlined green rectangles.
If you change the code from intersects to fitz.Rect(w[:4]) in annot.rect, then only the two lines fully contained in the blue rect will be delivered.

I tried that one as well. Still want to reduce the extraction to the exact highlighted text. I want to analyse the highlighted text for users, so want as little pollution as possible.

Thanks for another quick solution. I will try it now.

@edxu96
Copy link

edxu96 commented Jul 11, 2020

Yor next question will probably be:
"How can I shrink the selection down to the highlighted words only?"

Thanks for the reply. The method works. I did try Annot.lineEnds, but it is not applicable to highlights.

Will write more code to make the program robust, so that it knows when to include the intersections.

@JorjMcKie
Copy link
Collaborator

Annot.lineEnds is just a pair of ints encoding the line end symbols for applicable annot types. Has nothing to do with your problem.

Here is a more compact code snippet:

points = annot.vertices
quad_count = int(len(points) / 4)
highlight_words = []
for i in range(quad_count):
    r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
    for w in wordlist:
        if fitz.Rect(w[:4]) in r:
            highlight_words.append(w[4])

print(" ".join(highlight_words))

Delivers exactly the highlighted text.

@edxu96
Copy link

edxu96 commented Jul 15, 2020

I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word.

_threshold_intersection = 0.9  # if the intersection is large enough.


def _check_contain(r_word, points):
    """If `r_word` is contained in the rectangular area.

    The area of the intersection should be large enough compared to the
    area of the given word.

    Args:
        r_word (fitz.Rect): rectangular area of a single word.
        points (list): list of points in the rectangular area of the
            given part of a highlight.

    Returns:
        bool: whether `r_word` is contained in the rectangular area.
    """
    # `r` is mutable, so everytime a new `r` should be initiated.
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _extract_annot(annot, words_on_page):
    """Extract words in a given highlight.

    Args:
        annot (fitz.Annot): [description]
        words_on_page (list): [description]

    Returns:
        str: words in the entire highlight.
    """
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in words_on_page if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)

    return sentence

@JorjMcKie
Copy link
Collaborator

Excellent!
Please also bear in mind, that all text of all text marker annotations (i.e. also underlines, strikethrough, etc.) can be exracted this way - even when text is not spread across more than one line: the annot.vertices always are there.

@edxu96
Copy link

edxu96 commented Jul 15, 2020

Yes, you always foresees the next problem 😅. I just got out of some errors in other part of my package because of the underline annotations. So I debugged it by checking the type of annotations first before passing it to these two functions.

@prashantkg96
Copy link

prashantkg96 commented Jan 8, 2021

Hey, @edxu96 @JorjMcKie this thread was really helpful with one of my ongoing project. It was really helpful in extracting text from a paragraph but it seems to fail when I run the same on a table. I have used the enhance method by @edxu96 and called the function _extract_annot for each annotation. to generate a list. Unfortunately, the list is populated with empty entries. The size of the list is correct and matches the number of highlights in a doc but the content of each item is blank.
Below is the screenshot
image
image

Could you please suggest what's wrong here

here is the piece of script which calls the above function

    doc = fitz.open(currentpdfPath)
    page = doc[6]
    annot = page.annots()       #page.firstAnnot
    annot2 = page.annots()  
    words = page.getText("words")
    annotIterVal = sum(1 for x in annot2)
    pageOutlist = []
    for i in range(annotIterVal):
        thisAnnot = annot.__next__()
        print(thisAnnot)
        finalout = _extract_annot(thisAnnot, words)
        pageOutlist.append(finalout)
        
    
    
    print(pageOutlist)

Here is the link for the document I used. I have randomly highlighted a few rows and their values

EDIT: turns out setting _threshold_intersection = 0.1 makes it work.

@edxu96
Copy link

edxu96 commented Jan 8, 2021

@shadowwarrior29 Glad to hear that it work. Do you still need help?

By the way, I could't find any highlight in your document. Probably because I am using Mac.

@prashantkg96
Copy link

@edxu96 the script is working as expected so no help needed at the moment :)

Sorry, I should have mentioned that the document in the link is not highlighted and I manually highlighted them after downloading for testing purposes. Here is that highlighted doc I've been using to test the script for different highlighted text scenario
appl.pdf

@dummifiedme
Copy link

@edxu96 I keep getting this error.

image

@JorjMcKie
Copy link
Collaborator

@dummifiedme - unfortunately the function _check_contain cannot be inspected, but I gues the sequence of points and the rectangle must be reversed.
The error message says, that Rect creation did not happen with 4 floats.

@dummifiedme
Copy link

dummifiedme commented Jan 12, 2021

The _check_contain function is from the code further up there in the thread.

I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word.

_threshold_intersection = 0.9  # if the intersection is large enough.


def _check_contain(r_word, points):
    """If `r_word` is contained in the rectangular area.

    The area of the intersection should be large enough compared to the
    area of the given word.

    Args:
        r_word (fitz.Rect): rectangular area of a single word.
        points (list): list of points in the rectangular area of the
            given part of a highlight.

    Returns:
        bool: whether `r_word` is contained in the rectangular area.
    """
    # `r` is mutable, so everytime a new `r` should be initiated.
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _extract_annot(annot, words_on_page):
    """Extract words in a given highlight.

    Args:
        annot (fitz.Annot): [description]
        words_on_page (list): [description]

    Returns:
        str: words in the entire highlight.
    """
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in words_on_page if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)

    return sentence

EDIT: Found the error, it was on my part. I didn't pass the "words" argument in the getText() function

@Saumya-09
Copy link

Saumya-09 commented Oct 22, 2021

I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word.

_threshold_intersection = 0.9  # if the intersection is large enough.


def _check_contain(r_word, points):
    """If `r_word` is contained in the rectangular area.

    The area of the intersection should be large enough compared to the
    area of the given word.

    Args:
        r_word (fitz.Rect): rectangular area of a single word.
        points (list): list of points in the rectangular area of the
            given part of a highlight.

    Returns:
        bool: whether `r_word` is contained in the rectangular area.
    """
    # `r` is mutable, so everytime a new `r` should be initiated.
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _extract_annot(annot, words_on_page):
    """Extract words in a given highlight.

    Args:
        annot (fitz.Annot): [description]
        words_on_page (list): [description]

    Returns:
        str: words in the entire highlight.
    """
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in words_on_page if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)

    return sentence

Can you paste the entire code. I am trying to work on a similar problem and it would be helpful
This the trial file I am working on:
trial.pdf

import fitz
from itertools import groupby
from pprint import pprint

_threshold_intersection = 0.1 # if the intersection is large enough.

def _check_contain(r_word, points):
"""If r_word is contained in the rectangular area.

The area of the intersection should be large enough compared to the
area of the given word.

Args:
    r_word (fitz.Rect): rectangular area of a single word.
    points (list): list of points in the rectangular area of the
        given part of a highlight.

Returns:
    bool: whether `r_word` is contained in the rectangular area.
"""
# `r` is mutable, so everytime a new `r` should be initiated.
r = fitz.Quad(points).rect
r.intersect(r_word)

if r.get_area() >= r_word.get_area() * _threshold_intersection:
    contain = True
else:
    contain = False
return contain

def _extract_annot(annot, words_on_page):
"""Extract words in a given highlight.

Args:
    annot (fitz.Annot): [description]
    words_on_page (list): [description]

Returns:
    str: words in the entire highlight.
"""
quad_points = annot.vertices
pprint(len(annot.vertices))
print("quad_points: ",quad_points)

quad_count = int(len(quad_points) / 4)
print("quad_count: ",quad_count)

sentences = ['' for i in range(quad_count)]
for i in range(quad_count):
    points = quad_points[i * 4: i * 4 + 4]
    print(points)
    words = [
        w for w in words_on_page if
        _check_contain(fitz.Rect(w[:4]), points)
    ]
sentence = ' '.join(sentences)
print("The words are: ", sentence)
# return sentence

def main():
doc = fitz.open('trial.pdf')
page = doc[0]
wordlist = page.get_text("words")
annot = page.firstAnnot
# print(annot)
if annot == None:
print("Page contains no highlighted text")
else:
_extract_annot(annot, wordlist)

if name == "main":
main()

@JorjMcKie @edxu96
Is there something wrong with the code?
I am not able to extract the highlighted text

@edxu96
Copy link

edxu96 commented Oct 25, 2021

Can you paste the entire code. I am trying to work on a similar problem and it would be helpful

@Saumya-09 I haven't checked your code and haven't worked on PDF for a while, but this is where I used those functions in zequnyu/cmdict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants