Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Italic info in hocr output #1371

Closed
niksedk opened this issue Mar 11, 2018 · 4 comments
Closed

Italic info in hocr output #1371

niksedk opened this issue Mar 11, 2018 · 4 comments

Comments

@niksedk
Copy link

niksedk commented Mar 11, 2018

I cannot find any italic info In Tesseract 4.00.00alpha hocr output.

Tesseract 3.x included this info via the em tag.

It would be very helpful if this could be added again in some way.

@stweil
Copy link
Member

stweil commented Mar 11, 2018

That's a missing feature of the new LSTM engine: it does not support attributes like bold, italic and more.

Tesseract 4 still supports the old OCR engine as long as you use traineddata files which include the necessary information. The files from https://github.com/tesseract-ocr/tessdata will work for you.

@niksedk
Copy link
Author

niksedk commented Mar 11, 2018

@stweil: thx for the info :)
Is it likely this feature will be added to the new LSTM engine at some point?

@stweil
Copy link
Member

stweil commented Mar 11, 2018

Who knows? I don't – maybe @theraysmith has plans to enhance the LSTM engine in that direction.

@amitdo
Copy link
Collaborator

amitdo commented Mar 11, 2018

#1074 (comment)

@zdenop zdenop closed this as completed Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants