Skip to content

Noise characters recognized with bbox as the entire page #1192

Open
@TerryZH

Description

Environment

  • Tesseract Version: v4.00.00dev-692-gad5ee18 with Leptonica
  • Commit Number: ad5ee18
  • Platform: MAC OS 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Current Behavior:

Line 1, unexpected '__' recognized between 1941 and Ritter, with bbox as the entire page.

sample
Corresponding HOCR line:
GS 1 2,261,002 Oct. 28,1941 __ Ritter 760 $FO

Expected Behavior:

'__' is not supposed to be recognized in the first place. If the false positive recognition is inevitable, the bbox information should be accurate.

Suggested Fix:

n/a

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions