Lstmbox #2216

Shreeshrii · 2019-02-02T11:48:33Z

Create box files (using code similar to tsv renderer) in the format needed for LSTM training i.e. with a line for space after every word and a line with tab to mark end of line.

Shreeshrii · 2019-02-02T12:52:52Z

I have tested this for eng and hin. It will require testing for RTL languages.

Since the character level bounding boxes are NOT accurate with LSTM engine, I changed the format to the one used by ocrd-train i.e. bounding info at TEXTLINE level for all characters on a line.

zdenop · 2019-02-02T19:41:52Z

Sorry if I miss some information, but what is user case for this?

Shreeshrii · 2019-02-02T19:57:32Z

When tesseract is used with makebox to create box files from scanned images, the format is suitable for training for base tesseract (--oem 0) but does not have the boxes for space between words and tabs to mark end of line as needed for LSTM training.

This PR allows creation of box files from images, in the format needed by LSTM training. The box files will still need to be edited for accuracy (similar to tesseract 3).

This can be useful when someone wants to finetune for a particular typeface which is not available as a font.

Shreeshrii · 2019-02-03T04:10:19Z

Since there are lots of requests by users who want to train using images, I wanted to add this option.
The code in src/api/lstmboxrenderer.cpp can be streamlined. I do not know c++ to be able to do that.
@stweil Please review and fix. Thanks.

src/api/renderer.h

Shreeshrii · 2019-02-03T15:51:57Z

testbox.zip

Test results for eng, hin, ara, chi_sim and chi_tra

Both chi_sim and chi_tra are including extra spaces when the word has both Chinese and Latin script.

(cherry picked from commit 921da6b) fix typo (cherry picked from commit 7bd1a0c) Add lstmboxrenderer to CMakeLists (cherry picked from commit cfef3a8) fix formatting (cherry picked from commit 7ba2b01)

(cherry picked from commit 049db10)

amitdo · 2019-02-07T17:35:57Z

text2image does not output space char at EOL.

Shreeshrii · 2019-02-08T03:15:57Z

@amitdo Thanks. Will change.

Have you checked the output for RTL eg. Hebrew?

amitdo · 2019-02-08T05:58:56Z

The ara lstmbox in testbox.zip looks fine.

stweil reviewed Feb 3, 2019

View reviewed changes

src/api/renderer.h Outdated Show resolved Hide resolved

Shreeshrii added 2 commits February 5, 2019 14:03

Add a new renderer to create box files from images for LSTM training

9c89cd5

(cherry picked from commit 921da6b) fix typo (cherry picked from commit 7bd1a0c) Add lstmboxrenderer to CMakeLists (cherry picked from commit cfef3a8) fix formatting (cherry picked from commit 7ba2b01)

change to use bbox coordinates for TEXTLINE for all characters

0f42fd8

(cherry picked from commit 049db10)

Shreeshrii closed this Feb 5, 2019

Shreeshrii deleted the lstmbox branch February 5, 2019 15:54

Shreeshrii restored the lstmbox branch February 5, 2019 15:55

Shreeshrii reopened this Feb 5, 2019

Shreeshrii added 2 commits February 10, 2019 05:13

change to const char* as suggested by @stweil

b51c1bf

put common code in AddBoxToLSTM

3110536

zdenop merged commit 2ae65b2 into tesseract-ocr:master Feb 10, 2019

Shreeshrii deleted the lstmbox branch February 20, 2019 11:12

amitdo added output issues related output formats RTL enhancement text2image labels Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lstmbox #2216

Lstmbox #2216

Shreeshrii commented Feb 2, 2019 •

edited

Loading

Shreeshrii commented Feb 2, 2019

zdenop commented Feb 2, 2019

Shreeshrii commented Feb 2, 2019

Shreeshrii commented Feb 3, 2019

Shreeshrii commented Feb 3, 2019

amitdo commented Feb 7, 2019

Shreeshrii commented Feb 8, 2019

amitdo commented Feb 8, 2019

Lstmbox #2216

Lstmbox #2216

Conversation

Shreeshrii commented Feb 2, 2019 • edited Loading

Shreeshrii commented Feb 2, 2019

zdenop commented Feb 2, 2019

Shreeshrii commented Feb 2, 2019

Shreeshrii commented Feb 3, 2019

Shreeshrii commented Feb 3, 2019

amitdo commented Feb 7, 2019

Shreeshrii commented Feb 8, 2019

amitdo commented Feb 8, 2019

Shreeshrii commented Feb 2, 2019 •

edited

Loading