Skip to content

I am unable to extract any type of text from my document #3858

Discussion options

You must be logged in to vote

If page.get_text() returns no (or only white) text, then all you have is a number of heuristics to determine the situation:

  • The page may be covered completely by an image. If this is true, then OCR is an option to check whether text is hiding inside that image.
  • The page may contain so-called annotations only and no other content. To confirm, check page.first_annot is None.
  • The page may contain vector graphics that "simulate" (i.e. look like) text. This is the hardest case - but doesn't happen very often.

Just saw that you determined an image is covering your page. To OCR a page do this:

page = doc[pno]  # load page with 0-based number pno
tp = page.get_textpage_ocr(full=True, dpi=150) #…

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@LSUCDS
Comment options

@LSUCDS
Comment options

@JorjMcKie
Comment options

@LSUCDS
Comment options

Answer selected by LSUCDS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants