Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lstmbox #2216

Merged
merged 4 commits into from
Feb 10, 2019
Merged

Lstmbox #2216

merged 4 commits into from
Feb 10, 2019

Conversation

Shreeshrii
Copy link
Collaborator

@Shreeshrii Shreeshrii commented Feb 2, 2019

Create box files (using code similar to tsv renderer) in the format needed for LSTM training i.e. with a line for space after every word and a line with tab to mark end of line.

@Shreeshrii
Copy link
Collaborator Author

I have tested this for eng and hin. It will require testing for RTL languages.

Since the character level bounding boxes are NOT accurate with LSTM engine, I changed the format to the one used by ocrd-train i.e. bounding info at TEXTLINE level for all characters on a line.

@zdenop
Copy link
Contributor

zdenop commented Feb 2, 2019

Sorry if I miss some information, but what is user case for this?

@Shreeshrii
Copy link
Collaborator Author

When tesseract is used with makebox to create box files from scanned images, the format is suitable for training for base tesseract (--oem 0) but does not have the boxes for space between words and tabs to mark end of line as needed for LSTM training.

This PR allows creation of box files from images, in the format needed by LSTM training. The box files will still need to be edited for accuracy (similar to tesseract 3).

This can be useful when someone wants to finetune for a particular typeface which is not available as a font.

@Shreeshrii
Copy link
Collaborator Author

Since there are lots of requests by users who want to train using images, I wanted to add this option.
The code in src/api/lstmboxrenderer.cpp can be streamlined. I do not know c++ to be able to do that.
@stweil Please review and fix. Thanks.

src/api/renderer.h Outdated Show resolved Hide resolved
@Shreeshrii
Copy link
Collaborator Author

testbox.zip

Test results for eng, hin, ara, chi_sim and chi_tra

Both chi_sim and chi_tra are including extra spaces when the word has both Chinese and Latin script.

(cherry picked from commit 921da6b)

fix typo

(cherry picked from commit 7bd1a0c)

Add lstmboxrenderer to CMakeLists

(cherry picked from commit cfef3a8)

fix formatting

(cherry picked from commit 7ba2b01)
@Shreeshrii Shreeshrii closed this Feb 5, 2019
@Shreeshrii Shreeshrii deleted the lstmbox branch February 5, 2019 15:54
@Shreeshrii Shreeshrii restored the lstmbox branch February 5, 2019 15:55
@Shreeshrii Shreeshrii reopened this Feb 5, 2019
@amitdo
Copy link
Collaborator

amitdo commented Feb 7, 2019

text2image does not output space char at EOL.

@Shreeshrii
Copy link
Collaborator Author

@amitdo Thanks. Will change.

Have you checked the output for RTL eg. Hebrew?

@amitdo
Copy link
Collaborator

amitdo commented Feb 8, 2019

The ara lstmbox in testbox.zip looks fine.

@zdenop zdenop merged commit 2ae65b2 into tesseract-ocr:master Feb 10, 2019
@Shreeshrii Shreeshrii deleted the lstmbox branch February 20, 2019 11:12
@amitdo amitdo added output issues related output formats RTL enhancement text2image labels Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement output issues related output formats RTL text2image
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants