High CPU usage on multicore

I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at https://github.com/nextcloud/files_fulltextsearch_tesseract/issues/14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.

After a short search, I stumbled over the possibility to use a thread limit (https://github.com/thiagoalessio/tesseract-ocr-for-php#thread-limit). It seems there are cases (like mine) in which tesseract is blocking itself with too many cores available (see also https://github.com/tesseract-ocr/tesseract/issues/898).

Thus, I did some testing with this thread limit...
**OCR Settings:**
**Limit PDF pages:** 20
**Timeout:** 60 seconds

**Testfile:** Nextcloud Manual.pdf

I measured the runtime for this loop:
https://github.com/nextcloud/files_fulltextsearch_tesseract/blob/e1405e493ecdd40718cf28b54dd9ec867fc8b606/lib/Service/TesseractService.php#L252

**Runtime for all pages:**
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), **EDIT:** It took 1119.51 seconds (18.66 minutes) to finish.

As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage on multicore #61

XueSheng-GIT
openedon Aug 23, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High CPU usage on multicore #61

Description

XueSheng-GITopenedon Aug 23, 2023

Metadata