Multiprocess 4.00.00alpha way slower than 3.03

Hi,

I need to do OCR on a lot of multipage TIF documents. After reading https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-294275453 I decided to run several Tesseract processes in parallel.

With tesseract 3.03, OCR speeds increases linearly (more or less) with the number of processes. However, with 4.00.00alpha all processes are blocked at the first page and it seems to take an infinitely long time to process this first page. If I manually pause a process, others are able to resume processing.

The problems seems to be caused by the fact that v4.00 uses up to 4 CPUs to process a multipage TIF (one is saturated and the other 3 are used at about 25%). So if you run 4 processes in parallel on a 4-CPU machine, they're stuck. That's also why launching two processes in parallel on an 8-CPU machine is OK but launching 8 is infinitely slow.

I got the same problem on Ubuntu 14.04.5 LTS and Amazon Linux AMI 2016.09.

Is it a bug on the alpha version? Or is it a feature meant to fasten the processing of multipage TIFF images?

Thanks for any help you can provide.

---------

tesseract 3.05.00 ( 2ca5d0a ) is OK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiprocess 4.00.00alpha way slower than 3.03 #898

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiprocess 4.00.00alpha way slower than 3.03 #898

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions