Description
I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at #14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.
After a short search, I stumbled over the possibility to use a thread limit (https://github.com/thiagoalessio/tesseract-ocr-for-php#thread-limit). It seems there are cases (like mine) in which tesseract is blocking itself with too many cores available (see also tesseract-ocr/tesseract#898).
Thus, I did some testing with this thread limit...
OCR Settings:
Limit PDF pages: 20
Timeout: 60 seconds
Testfile: Nextcloud Manual.pdf
I measured the runtime for this loop:
Runtime for all pages:
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), EDIT: It took 1119.51 seconds (18.66 minutes) to finish.
As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).