Is it posible to extract highlighted text? #318

joelostblom · 2019-07-04T20:33:57Z

I noticed this helpful wikipost on how to select text underlying a rectangle. Is it also possible to select highlighted text that is not in the shape of a rectangle? Preferably together with any comments/annotations made on that text regions.

JorjMcKie · 2019-07-05T11:03:58Z

You can extract the text (and images) from pages via page.getText("dict"). This works for non-PDF document also.
The result is a dictionary explained here. Except for text colors, this dictionary could be used to reconstruct a full document page in its original look, including images.
It would be your task to relate any annotations or links to those data: they are not be contained in that dict.

joelostblom · 2019-07-05T16:00:14Z

Thanks for the quick reply @JorjMcKie !

Just to make sure I understand correctly, when you say "It would be your task to relate any annotations or links to those data: they are not be contained in that dict." Do you mean that I will need to manually search for any text snippet I want to extract? Or is there a function within pymupdf which can extract the coordinates of the highlighted areas/texts and then I can use these coordinates to extract the relevant snippets from the dictionary containing all the extracted text from a page?

JorjMcKie · 2019-07-05T16:27:45Z

Not quite.

The PDF concept of annotations represents a way to add "comments" or remarks with a reduced permission level. They don't count as "edits".
The "normal" page text is not affected by annots - annots are like dust which can be wiped off again, if you follow this metaphor.
Some annot types support highlighting and underlining text.

The basic PDF page's text itself can only have properties inherited from the font in use, like italic, bold, monospaced, serifed.

It cannot directly be highlighted or underlined. Any such effects come from outside the actual text specification (for which annots are just an example).
So it is your task to e.g. take the rectangle from a highlighting annotation, then dig your way through the page extraction dictionary and look for text pieces (so-called "spans") which are located within that rectangle.
This is what I was alluding to in the previous post.

joelostblom · 2019-07-05T18:12:36Z

Thanks for taking the time to elaborate on your explanation. This makes sense to me. I have been able to extract the highlight rectangles coordinates using PyPDF2 and python-poppler-qt5 (details here), but using those packages I had issues finding the text spans overlapping with those coordinates. Next, I will try using those coordinates in combination with the wiki page on how to extract text from a rectangle with PyMuPDF, and hopefully that should do the trick. Thanks again!

JorjMcKie · 2019-07-05T18:28:39Z

Bah, this is treason ... ;-)

You don't need another package for locating annotations. Why not using PyMuPDF for this?

JorjMcKie · 2019-07-05T18:40:02Z

scan through a page's annotations like this to find underline / highlight / strikeout / squiggly underlined annotations:

annot = page.firstAnnot

while annot:
    if annot.type[0] in (8, 9, 10, 11): # one of the 4 types above
        rect = annot.rect # this is the rectangle the annot covers
        # extract the text within that rect ...
    annot = annot.next # None returned after last annot

joelostblom · 2019-07-05T18:53:38Z

Ah that's great thanks! I saw that the rectangle area in the wiki example was constructed by searching for the actual words, which is why I wasn't sure if there was a way to find the highlighted area coordinates/rectangle using PyMuPDF. I will start off from your code, thanks again!

edxu96 · 2020-07-11T15:22:11Z

Hi, @JorjMcKie . Thanks for the wonderful package. Some enhancement may be required, because many words out of the highlighted area can be extracted as well. Here is my code.

import fitz
from itertools import groupby


def print_hightlight_text(page, rect):
    """Return text containted in the given rectangular highlighted area.

    Args:
        page (fitz.page): the associated page.
        rect (fitz.Rect): rectangular highlighted area.
    """
    words = page.getText("words")  # list of words on page
    words.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x
    mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]
    group = groupby(mywords, key=lambda w: w[3])
    for y1, gwords in group:
        print(" ".join(w[4] for w in gwords))


def main():
    doc = fitz.open('./PDF/sample-3.pdf')
    page = doc[0]
    annot = page.firstAnnot
    print_hightlight_text(page, annot.rect)


if __name__ == "__main__":
    main()

The output of this file will be:

demand for biomass is uncertain at that time, and heat demand and electricity prices
vary drastically during the planning period. Furthermore, the optimal operation of
combined heat and power plants has to consider the existing synergies between the
power and heating systems. We propose a solution method using stochastic optimi-

Even when the fully contained words are extracted with the first method in this wiki post, the words before and after the highlight will be selected anyway:

vary drastically during the planning period. Furthermore, the optimal operation of
combined heat and power plants has to consider the existing synergies between the

Any hint as to this problem? My program will parse lots of highlights, so the introduction of that noise will have a huge impact.

JorjMcKie · 2020-07-11T17:05:12Z

The problems is this:

The annot rect is the blue one - not just the highlighted words!
The intersecting words are those surrounded with thinlined green rectangles.
If you change the code from intersects to fitz.Rect(w[:4]) in annot.rect, then only the two lines fully contained in the blue rect will be delivered.

JorjMcKie · 2020-07-11T17:23:03Z

Yor next question will probably be:
"How can I shrink the selection down to the highlighted words only?"

That should also be possible:

The annotation was created surrounding text which spreads across more than 1 line
This situation is reflected by a sequence of quadrilaterals - Quad objects in PyMuPDF.
Tha annot rect is the smallest rectangle containing all those quads.
1 quad is given by 4 point-like objects.
You can extract these points via annot.vertices:

>>> len(annot.vertices)  # always a multiple of 4!
8
>>> from pprint import pprint
>>> pprint(annot.vertices)
[(304.84161376953125, 274.20062255859375),
 (388.37060546875, 274.20062255859375),
 (304.84161376953125, 289.2906188964844),
 (388.37060546875, 289.2906188964844),
 (51.02360153198242, 286.20062255859375),
 (182.28359985351562, 286.20062255859375),
 (51.02360153198242, 301.2906188964844),
 (182.28359985351562, 301.2906188964844)]
>>>

So if you do rect1 = fitz.Quad(annot.vertices[:4).rect] and rect2 = fitz.Quad(annot.vertices[4:]).rect, you get the two part rectangles and can do this:

>>> for word in wlist:
	if fitz.Rect(word[:4] in rect1:
		     
SyntaxError: invalid syntax
>>> for word in wlist:
	if fitz.Rect(word[:4]) in rect1:
		print(word[4])

		
optimal
operation
of
>>> for word in wlist:
	if fitz.Rect(word[:4]) in rect2:
		print(word[4])

		
combined
heat
and
power
plants
>>>

This should be what you wanted ... 😎

edxu96 · 2020-07-11T17:24:25Z

The problems is this:

The annot rect is the blue one - not just the highlighted words!
The intersecting words are those surrounded with thinlined green rectangles.
If you change the code from intersects to fitz.Rect(w[:4]) in annot.rect, then only the two lines fully contained in the blue rect will be delivered.

I tried that one as well. Still want to reduce the extraction to the exact highlighted text. I want to analyse the highlighted text for users, so want as little pollution as possible.

Thanks for another quick solution. I will try it now.

edxu96 · 2020-07-11T17:40:14Z

Yor next question will probably be:
"How can I shrink the selection down to the highlighted words only?"

Thanks for the reply. The method works. I did try Annot.lineEnds, but it is not applicable to highlights.

Will write more code to make the program robust, so that it knows when to include the intersections.

JorjMcKie · 2020-07-11T17:49:09Z

Annot.lineEnds is just a pair of ints encoding the line end symbols for applicable annot types. Has nothing to do with your problem.

Here is a more compact code snippet:

points = annot.vertices
quad_count = int(len(points) / 4)
highlight_words = []
for i in range(quad_count):
    r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
    for w in wordlist:
        if fitz.Rect(w[:4]) in r:
            highlight_words.append(w[4])

print(" ".join(highlight_words))

Delivers exactly the highlighted text.

edxu96 · 2020-07-15T13:53:24Z

I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word.

_threshold_intersection = 0.9  # if the intersection is large enough.


def _check_contain(r_word, points):
    """If `r_word` is contained in the rectangular area.

    The area of the intersection should be large enough compared to the
    area of the given word.

    Args:
        r_word (fitz.Rect): rectangular area of a single word.
        points (list): list of points in the rectangular area of the
            given part of a highlight.

    Returns:
        bool: whether `r_word` is contained in the rectangular area.
    """
    # `r` is mutable, so everytime a new `r` should be initiated.
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _extract_annot(annot, words_on_page):
    """Extract words in a given highlight.

    Args:
        annot (fitz.Annot): [description]
        words_on_page (list): [description]

    Returns:
        str: words in the entire highlight.
    """
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in words_on_page if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)

    return sentence

JorjMcKie · 2020-07-15T14:34:03Z

Excellent!
Please also bear in mind, that all text of all text marker annotations (i.e. also underlines, strikethrough, etc.) can be exracted this way - even when text is not spread across more than one line: the annot.vertices always are there.

edxu96 · 2020-07-15T20:10:53Z

Yes, you always foresees the next problem 😅. I just got out of some errors in other part of my package because of the underline annotations. So I debugged it by checking the type of annotations first before passing it to these two functions.

prashantkg96 · 2021-01-08T08:07:28Z

Hey, @edxu96 @JorjMcKie this thread was really helpful with one of my ongoing project. It was really helpful in extracting text from a paragraph but it seems to fail when I run the same on a table. I have used the enhance method by @edxu96 and called the function _extract_annot for each annotation. to generate a list. Unfortunately, the list is populated with empty entries. The size of the list is correct and matches the number of highlights in a doc but the content of each item is blank.
Below is the screenshot

Could you please suggest what's wrong here

here is the piece of script which calls the above function

    doc = fitz.open(currentpdfPath)
    page = doc[6]
    annot = page.annots()       #page.firstAnnot
    annot2 = page.annots()  
    words = page.getText("words")
    annotIterVal = sum(1 for x in annot2)
    pageOutlist = []
    for i in range(annotIterVal):
        thisAnnot = annot.__next__()
        print(thisAnnot)
        finalout = _extract_annot(thisAnnot, words)
        pageOutlist.append(finalout)
        
    
    
    print(pageOutlist)

Here is the link for the document I used. I have randomly highlighted a few rows and their values

EDIT: turns out setting _threshold_intersection = 0.1 makes it work.

edxu96 · 2021-01-08T09:54:58Z

@shadowwarrior29 Glad to hear that it work. Do you still need help?

By the way, I could't find any highlight in your document. Probably because I am using Mac.

prashantkg96 · 2021-01-08T11:19:56Z

@edxu96 the script is working as expected so no help needed at the moment :)

Sorry, I should have mentioned that the document in the link is not highlighted and I manually highlighted them after downloading for testing purposes. Here is that highlighted doc I've been using to test the script for different highlighted text scenario
appl.pdf

dummifiedme · 2021-01-11T19:02:00Z

@edxu96 I keep getting this error.

JorjMcKie · 2021-01-11T23:32:30Z

@dummifiedme - unfortunately the function _check_contain cannot be inspected, but I gues the sequence of points and the rectangle must be reversed.
The error message says, that Rect creation did not happen with 4 floats.

dummifiedme · 2021-01-12T03:43:22Z

The _check_contain function is from the code further up there in the thread.

I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word.

_threshold_intersection = 0.9  # if the intersection is large enough.


def _check_contain(r_word, points):
    """If `r_word` is contained in the rectangular area.

    The area of the intersection should be large enough compared to the
    area of the given word.

    Args:
        r_word (fitz.Rect): rectangular area of a single word.
        points (list): list of points in the rectangular area of the
            given part of a highlight.

    Returns:
        bool: whether `r_word` is contained in the rectangular area.
    """
    # `r` is mutable, so everytime a new `r` should be initiated.
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _extract_annot(annot, words_on_page):
    """Extract words in a given highlight.

    Args:
        annot (fitz.Annot): [description]
        words_on_page (list): [description]

    Returns:
        str: words in the entire highlight.
    """
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in words_on_page if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)

    return sentence

EDIT: Found the error, it was on my part. I didn't pass the "words" argument in the getText() function

Saumya-09 · 2021-10-22T11:27:31Z

I enhanced your method by selecting words according to intersection areas. The word is only selected when the highlight contains at least 90% of that word.

_threshold_intersection = 0.9  # if the intersection is large enough.


def _check_contain(r_word, points):
    """If `r_word` is contained in the rectangular area.

    The area of the intersection should be large enough compared to the
    area of the given word.

    Args:
        r_word (fitz.Rect): rectangular area of a single word.
        points (list): list of points in the rectangular area of the
            given part of a highlight.

    Returns:
        bool: whether `r_word` is contained in the rectangular area.
    """
    # `r` is mutable, so everytime a new `r` should be initiated.
    r = fitz.Quad(points).rect
    r.intersect(r_word)

    if r.getArea() >= r_word.getArea() * _threshold_intersection:
        contain = True
    else:
        contain = False
    return contain


def _extract_annot(annot, words_on_page):
    """Extract words in a given highlight.

    Args:
        annot (fitz.Annot): [description]
        words_on_page (list): [description]

    Returns:
        str: words in the entire highlight.
    """
    quad_points = annot.vertices
    quad_count = int(len(quad_points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        points = quad_points[i * 4: i * 4 + 4]
        words = [
            w for w in words_on_page if
            _check_contain(fitz.Rect(w[:4]), points)
        ]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)

    return sentence

Can you paste the entire code. I am trying to work on a similar problem and it would be helpful
This the trial file I am working on:
trial.pdf

import fitz
from itertools import groupby
from pprint import pprint

_threshold_intersection = 0.1 # if the intersection is large enough.

def _check_contain(r_word, points):
"""If r_word is contained in the rectangular area.

The area of the intersection should be large enough compared to the
area of the given word.

Args:
    r_word (fitz.Rect): rectangular area of a single word.
    points (list): list of points in the rectangular area of the
        given part of a highlight.

Returns:
    bool: whether `r_word` is contained in the rectangular area.
"""
# `r` is mutable, so everytime a new `r` should be initiated.
r = fitz.Quad(points).rect
r.intersect(r_word)

if r.get_area() >= r_word.get_area() * _threshold_intersection:
    contain = True
else:
    contain = False
return contain

def _extract_annot(annot, words_on_page):
"""Extract words in a given highlight.

Args:
    annot (fitz.Annot): [description]
    words_on_page (list): [description]

Returns:
    str: words in the entire highlight.
"""
quad_points = annot.vertices
pprint(len(annot.vertices))
print("quad_points: ",quad_points)

quad_count = int(len(quad_points) / 4)
print("quad_count: ",quad_count)

sentences = ['' for i in range(quad_count)]
for i in range(quad_count):
    points = quad_points[i * 4: i * 4 + 4]
    print(points)
    words = [
        w for w in words_on_page if
        _check_contain(fitz.Rect(w[:4]), points)
    ]
sentence = ' '.join(sentences)
print("The words are: ", sentence)
# return sentence

def main():
doc = fitz.open('trial.pdf')
page = doc[0]
wordlist = page.get_text("words")
annot = page.firstAnnot
# print(annot)
if annot == None:
print("Page contains no highlighted text")
else:
_extract_annot(annot, wordlist)

if name == "main":
main()

@JorjMcKie @edxu96
Is there something wrong with the code?
I am not able to extract the highlighted text

edxu96 · 2021-10-25T18:44:12Z

Can you paste the entire code. I am trying to work on a similar problem and it would be helpful

@Saumya-09 I haven't checked your code and haven't worked on PDF for a while, but this is where I used those functions in zequnyu/cmdict.

JorjMcKie added the example required label Jul 5, 2019

JorjMcKie closed this as completed Jul 9, 2019

edxu96 mentioned this issue Jul 11, 2020

Parsing text files, return searching results pastydev/cmdict#12

Closed

edxu96 mentioned this issue Jul 13, 2020

Extract text from PDF files pastydev/cmdict#31

Closed

Don-Yin mentioned this issue Jun 2, 2022

page.get_text(clip=rect) returns empty string in some cases #1741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it posible to extract highlighted text? #318

Is it posible to extract highlighted text? #318

joelostblom commented Jul 4, 2019

JorjMcKie commented Jul 5, 2019

joelostblom commented Jul 5, 2019

JorjMcKie commented Jul 5, 2019

joelostblom commented Jul 5, 2019

JorjMcKie commented Jul 5, 2019

JorjMcKie commented Jul 5, 2019

joelostblom commented Jul 5, 2019

edxu96 commented Jul 11, 2020

JorjMcKie commented Jul 11, 2020 •

edited

Loading

JorjMcKie commented Jul 11, 2020 •

edited

Loading

edxu96 commented Jul 11, 2020 •

edited

Loading

edxu96 commented Jul 11, 2020 •

edited

Loading

JorjMcKie commented Jul 11, 2020

edxu96 commented Jul 15, 2020

JorjMcKie commented Jul 15, 2020

edxu96 commented Jul 15, 2020

prashantkg96 commented Jan 8, 2021 •

edited

Loading

edxu96 commented Jan 8, 2021

prashantkg96 commented Jan 8, 2021

dummifiedme commented Jan 11, 2021

JorjMcKie commented Jan 11, 2021

dummifiedme commented Jan 12, 2021 •

edited

Loading

Saumya-09 commented Oct 22, 2021 •

edited

Loading

edxu96 commented Oct 25, 2021

Is it posible to extract highlighted text? #318

Is it posible to extract highlighted text? #318

Comments

joelostblom commented Jul 4, 2019

JorjMcKie commented Jul 5, 2019

joelostblom commented Jul 5, 2019

JorjMcKie commented Jul 5, 2019

joelostblom commented Jul 5, 2019

JorjMcKie commented Jul 5, 2019

JorjMcKie commented Jul 5, 2019

joelostblom commented Jul 5, 2019

edxu96 commented Jul 11, 2020

JorjMcKie commented Jul 11, 2020 • edited Loading

JorjMcKie commented Jul 11, 2020 • edited Loading

edxu96 commented Jul 11, 2020 • edited Loading

edxu96 commented Jul 11, 2020 • edited Loading

JorjMcKie commented Jul 11, 2020

edxu96 commented Jul 15, 2020

JorjMcKie commented Jul 15, 2020

edxu96 commented Jul 15, 2020

prashantkg96 commented Jan 8, 2021 • edited Loading

edxu96 commented Jan 8, 2021

prashantkg96 commented Jan 8, 2021

dummifiedme commented Jan 11, 2021

JorjMcKie commented Jan 11, 2021

dummifiedme commented Jan 12, 2021 • edited Loading

Saumya-09 commented Oct 22, 2021 • edited Loading

edxu96 commented Oct 25, 2021

JorjMcKie commented Jul 11, 2020 •

edited

Loading

JorjMcKie commented Jul 11, 2020 •

edited

Loading

edxu96 commented Jul 11, 2020 •

edited

Loading

edxu96 commented Jul 11, 2020 •

edited

Loading

prashantkg96 commented Jan 8, 2021 •

edited

Loading

dummifiedme commented Jan 12, 2021 •

edited

Loading

Saumya-09 commented Oct 22, 2021 •

edited

Loading