Skip to content

Tesseract creates hOCR output without text results #4112

Open
@stweil

Description

On some page images full of text Tesseract does not detect any text when using the default settings. Typically it prints Empty page!! twice for such pages. See issue #3021 for details and examples.

In some rare cases Tesseract prints Empty page!! only once and finds text in a 2nd pass. That text is written to ALTO and text output, but hOCR output does not show that text.

Example:

tesseract https://digi.bib.uni-mannheim.de/periodika/fileadmin/data/DeutReunP_856399094_19140210/max/856399094_1910_035_03.jpg 856399094_1910_035_03 alto hocr txt

Metadata

Assignees

No one assigned

    Labels

    bugoutputissues related output formats

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions