Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Tesseract performance #943

Open
stweil opened this issue May 22, 2017 · 23 comments
Open

RFC: Tesseract performance #943

stweil opened this issue May 22, 2017 · 23 comments

Comments

@stweil
Copy link
Contributor

stweil commented May 22, 2017

Performance is important for real time OCR, mass production OCR and training.

In this RFC I'd like to discuss performance bottlenecks and potential improvements.

See also the Tesseract wiki.

According to my tests with Valgrind's tool callgrind, these functions have the largest computational costs (ordered by decreasing time):

  • memory allocations / deallocations
  • tesseract::UnicharIdArrayUtils::compare
  • tesseract::DotProductAVX (or the other implementations of the dot product)
  • vfprintf (called from snprintf)

tesseract::UnicharIdArrayUtils::compare and memory allocations / deallocations are also the functions which are called most often.

I recently had a closer look at the dot product calculations and noticed that at least some input vectors are converted from float to double (which takes time). The dot product is always done with double values (more expensive than float). If memory bandwidth is the limiting factor, using double means doubled time compared with float. The current code uses 4 parallel threads. I have run some timing tests without that parallelization and got nearly the same execution time. @theraysmith, did you try using float for the dot product, and do you get better performance from parallelization in that part of the OCR process?

@theraysmith
Copy link
Contributor

theraysmith commented May 22, 2017 via email

@stweil
Copy link
Contributor Author

stweil commented May 23, 2017

Valgrind shows all callers, so I can provide that information (next week, as I'm currently busy with other things).

Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using float everywhere.

I used OpenMP and disabled it only in lstm/fullyconnected.cpp and in lstm/weightmatrix.cpp. The original code of the master branch shows that OpenMP works, as the real time is much less than the user time:

real    2m50,958s
user    7m39,712s
sys     0m2,128s

With OpenMP disabled for the parts mentioned above, OpenMP still works, the real time increases moderately while the user time decreases:

real    3m9,378s
user    7m33,084s
sys     0m2,092s

@stweil
Copy link
Contributor Author

stweil commented May 25, 2017

Stack for tesseract::UnicharIdArrayUtils::compare (called 2651872 times for small hello world image):

tesseract::UnicharIdArrayUtils::compare
ELIST::add_sorted_and_find
ELIST::add_sorted
tesseract::UnicharAmbigs::InsertIntoTable
tesseract::UnicharAmbigs::LoadUnicharAmbigs
...

@Shreeshrii
Copy link
Collaborator

Is this for --oem 0 or --oem 1 ?

I thought that Unicharambigs is not used by LSTM engine...

@stweil
Copy link
Contributor Author

stweil commented May 25, 2017

It is for LSTM:

valgrind --tool=callgrind --dump-line=yes --verbose api/tesseract --oem 1 hello.png /tmp/hello

The output is a large file called something like callgrind.out.1234 (replace 1234 by the process id).
I suggest using kcachegrind to see the results.

@Shreeshrii
Copy link
Collaborator

@stweil

Is that with debug? What about without it?

@stweil
Copy link
Contributor Author

stweil commented May 25, 2017

No, it was built with ./configure --disable-shared --disable-static 'CXXFLAGS=-Wall -g -O2'. I usually disable the library builds because they take additional time.

@stweil
Copy link
Contributor Author

stweil commented May 25, 2017

My first results reported above were from a very small image (single line hello world), so initialization contributes significantly.

With a really large image (newspaper), the result changes and tesseract::DotProductAVX is the dominating element. Surprisingly it is followed by gomp_team_barrier_wait_end and gomp_barrier_wait_end which according to Valgrind use nearly as much time as the dot product. Those two functions are part of OpenMP.

See also issue #898.

@stweil
Copy link
Contributor Author

stweil commented May 26, 2017

Unrolling the loop in tesseract::DotProductAVX results in nearly 7 % improvement:

# tesseract 0604.jp2 /tmp/0604 # git master, without OpenMP
real	2m54,469s
user	2m54,160s
sys	0m0,304s

# same test, but with improved tesseract::DotProductAVX
real	2m41,855s
user	2m41,576s
sys	0m0,272s

@stweil
Copy link
Contributor Author

stweil commented May 26, 2017

Latest numbers based on code from FAU Erlangen (thanks to @moebiusband73), Kahan part removed:

real	2m31,514s
user	2m31,220s
sys	0m0,280s

That is an improvement of 12 %. Using assembler code could improve further, but I'd expect the largest improvement from using float instead of double (trying to compensate the loss of precision by using the Kahan algorithm).

@mlissner
Copy link

The conversation here is largely over my head, but I came to the bug tracker to discuss performance in 4.0, and this bug is titled "RFC: Tesseract Performance" so it seems like the right place. (Apologies if I'm wrong.)

Simple question: On the "Neural nets In Tesseract" wiki page, it says:

On a machine with multiple cores, and AVX, an easy English image may take twice as much real time, and use 7 times the CPU as base Tesseract

But above (and elsewhere) says:

I have had some very good results in this area, with a network 3x faster than the legacy code (for English)

I do a lot of batch OCR using Tesseract. Can I expect 4.0 to be faster by 3× or slower by 7×? If the latter, that'll be something that we'll need to plan on, since our current infrastructure took over a month to complete the last batch OCR job using 3.02. If we need 7× more servers that's a huge deal — I don't know that we'd ever upgrade if that was the case. If it's 3× faster, that's incredible.

(Sorry again if this is the wrong place to bring this up. I'm trying to get a grasp on this situation. Thank you all for all your great work.)

@stweil
Copy link
Contributor Author

stweil commented May 30, 2017

The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument --oem 0), but there are plans to remove this feature from 4.0.

4.0 supports a new recognizer based on LSTM (--oem 1). The new code for LSTM uses parallel processing, so OCR can finish faster, but need more total CPU time. As 4.0 is still experimental, all timing results can change. Ray announced new trained language models for 4.0 with a smaller neural network. If that works, it could reduce CPU‌ time in comparison to 3.05.

Currently you won't reduce the infrastructure requirements with 4.0.

@mlissner
Copy link

That sounds promising, thanks @stweil. I'm perfectly happy not reducing infrastructure requirements, but increasing them 7× would be a very big deal.

FWIW, decreasing wall time via parallelization while increasing CPU time sounds good on paper, but unless I'm missing something (totally possible), it doesn't mean much for batch jobs like ours since we run 24 OCR processes in parallel already. Our server looks like this when doing a job:

screenshot from 2017-05-10 16-18-40

@stweil
Copy link
Contributor Author

stweil commented May 31, 2017

For your use case (which is similar to my own) I'd compile Tesseract without OpenMP support. Otherwise the parallelization will take a significant part of the CPU performance.

@amitdo
Copy link
Collaborator

amitdo commented May 31, 2017

What's the method you use to disable OpenMP?

Commenting AC_OPENMP or something else?

@stweil
Copy link
Contributor Author

stweil commented May 31, 2017

configure --disable-openmp

@amitdo
Copy link
Collaborator

amitdo commented Jun 1, 2017

--disable-shared --disable-static seems to be equivalent to just --disable-shared.

@Shreeshrii
Copy link
Collaborator

Please see #961 (comment)

for another example of decrease in user time with --disable-openmp.

By disabling openmp, the user time has become almost one third on WSL on my AMD Windows 10 PC.

@amitdo
Copy link
Collaborator

amitdo commented Jul 12, 2019

Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using float everywhere.

With some of NVIDIA's cards it's now possible to run the dotproduct on fp16 vectors.
sum_fp32 += u_fp16 * v_fp16

Google's TPUs also have something similar, with bfloat16 instead of the standard fp16.

@amitdo amitdo added the RFC label Mar 21, 2021
@Mark-Joy
Copy link

Mark-Joy commented Nov 7, 2023

Simply put, the LSTM model (oem=1) only produce faster execution time than legacy model (oem=0) if the number of images is less than the (machine maximum threads/4)

Example:
For an ordinary computer with maximum number of threads is 24, LSTM model gives faster execution time when processing a pdf with less than or equal to 6 pages (assuming 1 page is 1 image)

@Mark-Joy
Copy link

Mark-Joy commented Nov 7, 2023

The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument --oem 0), but there are plans to remove this feature from 4.0.

Please don't remove this feature. The LSTM model needs a large number of threads(24 threads) just to process a small pdf(6 pages). Currently, the legacy model is the best solution for this job.
Please poll us, the users - before remove this important feature.

btw, what is the usage scenario for tesseract 4 and 5 without the legacy model, if it turns out to be significant more time consuming in mass processing?

@stweil
Copy link
Contributor Author

stweil commented Nov 7, 2023

@Mark-Joy, that's only true if Tesseract is running with OpenMP multithreading enabled. I always suggest to disable multithreading because OpenMP wastes a lot of CPU resources. Then LSTM is still faster than legacy mode although it uses only a single CPU thread per page. On our OCR servers mass processing runs with 48 or 64 CPU threads which process the same number of page images simultaneously. And of course they use our self-trained LSTM models, not legacy models.

@Mark-Joy
Copy link

Mark-Joy commented Nov 7, 2023

@stweil, Thank you for the information.
With OpenMP disabled, may I ask which traineddata model can be used to get LTSM model faster than legacy one?

On my machine, only fast data model(oem=1) gives 10-15% faster than legacy, but it give unsatisfied results regarding text recognition. 6-7 out of 10 I would pick legacy result whenever there's a difference.

With traineddata that supports both legacy(oem=0) and LSTM(oem=1), LSTM gives 35% slower execution time.

The worst ones are the best data models(oem=1), LSTM gives 4-5 times(400%-500%) slower than legacy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants