I am unable to extract any type of text from my document #3858

LSUCDS · 2024-09-11T08:09:35Z

LSUCDS
Sep 11, 2024

I have the following code:

doc = fitz.open('path here')
print(doc.load_page(0).get_textpage().extractDICT())

which returns:

{'width': 841.9199829101562, 'height': 595.2000122070312, 'blocks': []}

I have also tried extractTEXT(), extractWORDS(), and extractRAWDICT() instead of extractDICT() but this returns either an empty string, an empty list or in the case of the raw dict version the above ^. There is only one page in the document. Could it be that the entire page is considered an image for some reason? That's what I thought at least, but running print(doc.load_page(0).get_images()) returns:

[(4, 0, 3508, 2480, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode')]

I'm not entirely sure what to do with this.

Note: The document is not encrypted nor locked with a password. I am unable to share the the document itself, but the content I want to extract doesn't seem like it's actually text as I am unable to highlight it by using my cursor. So is it after all an image and if so does PyMuPDF provide any form of OCR to extract the text from a specific region?

And if this is the case, what is the most efficient method of detecting whether the PDF sent to me is text or an image in order to create a robust load() method that can handle all types of PDFs? I don't know if checking the contents (and length) of extractDICT() would be the most robust method given that PDFs that are part text and part image exist. I'd have to check if it contains the specific text that I'm looking for in the dict returned by extractDICT() and if it's not present then assume that the entire PDF is an image.

UPDATE: I'm now 100% sure it's an image. I am trying to use get_textpage_ocr() but it does not exist for a Page object even though the documentation says so. Maybe I have misunderstood something?

Answered by JorjMcKie

Sep 11, 2024

If page.get_text() returns no (or only white) text, then all you have is a number of heuristics to determine the situation:

The page may be covered completely by an image. If this is true, then OCR is an option to check whether text is hiding inside that image.
The page may contain so-called annotations only and no other content. To confirm, check page.first_annot is None.
The page may contain vector graphics that "simulate" (i.e. look like) text. This is the hardest case - but doesn't happen very often.

Just saw that you determined an image is covering your page. To OCR a page do this:

page = doc[pno]  # load page with 0-based number pno
tp = page.get_textpage_ocr(full=True, dpi=150) #…

View full answer

JorjMcKie · 2024-09-11T08:58:58Z

JorjMcKie
Sep 11, 2024
Maintainer

If page.get_text() returns no (or only white) text, then all you have is a number of heuristics to determine the situation:

The page may be covered completely by an image. If this is true, then OCR is an option to check whether text is hiding inside that image.
The page may contain so-called annotations only and no other content. To confirm, check page.first_annot is None.
The page may contain vector graphics that "simulate" (i.e. look like) text. This is the hardest case - but doesn't happen very often.

Just saw that you determined an image is covering your page. To OCR a page do this:

page = doc[pno]  # load page with 0-based number pno
tp = page.get_textpage_ocr(full=True, dpi=150) # execute OCR, store results in the textpage

# now start extracting text, but *ALWAYS* refer to the textpage!!!
text = page.get_text(textpage=tp)  # for example

4 replies

LSUCDS Sep 11, 2024
Author

I know for a fact now that it's an image. Despite IntelliSense not picking up the method get_textpage_ocr(), the program would still run, so it is there. I just need to install Tesseract before I can test it out. After doing so (and if it works), I'll mark your reply as the answer.

What about detecting whether it is an image already or it actually has text? Is the best approach perhaps to simply detect whether it is an image and if not, convert it to an image and then use OCR?

LSUCDS Sep 11, 2024
Author

It's PyTesseract I have to install, right? And then:

from os import environ
environ["TESSDATA_PREFIX"] = "path/to/pytesseract/tessdata"

JorjMcKie Sep 11, 2024
Maintainer

You do not need the Python package pytesseract.
Install Tesseract directly - all this is extensively explained in the installation section of the documentation.
Setting os.environ["TESSDATA_PREFIX"] will not work! Either supply the tessdata parameter in each OCR-related call, or set the environment variable outside your script.

LSUCDS Sep 11, 2024
Author

There's no point in doing this for me. It complicates the installation process of my program if the user has to install Tesseract. I'll simply convert the document to an image and then use PyTesseract directly such that the user can install everything needed using requirements.txt.>

Turns out PyTesseract is simply a wrapper for the Tesseract engine, so it has to be installed either way. Oh well. 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I am unable to extract any type of text from my document #3858

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

I am unable to extract any type of text from my document #3858

LSUCDS Sep 11, 2024

Replies: 1 comment · 4 replies

JorjMcKie Sep 11, 2024 Maintainer

LSUCDS Sep 11, 2024 Author

LSUCDS Sep 11, 2024 Author

JorjMcKie Sep 11, 2024 Maintainer

LSUCDS Sep 11, 2024 Author

LSUCDS
Sep 11, 2024

Replies: 1 comment 4 replies

JorjMcKie
Sep 11, 2024
Maintainer

LSUCDS Sep 11, 2024
Author

LSUCDS Sep 11, 2024
Author

JorjMcKie Sep 11, 2024
Maintainer

LSUCDS Sep 11, 2024
Author