Skip to content

High CPU usage on multicore #61

Open

Description

I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at #14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.

After a short search, I stumbled over the possibility to use a thread limit (https://github.com/thiagoalessio/tesseract-ocr-for-php#thread-limit). It seems there are cases (like mine) in which tesseract is blocking itself with too many cores available (see also tesseract-ocr/tesseract#898).

Thus, I did some testing with this thread limit...
OCR Settings:
Limit PDF pages: 20
Timeout: 60 seconds

Testfile: Nextcloud Manual.pdf

I measured the runtime for this loop:

for ($i = 1; $i <= $pages; $i++) {

Runtime for all pages:
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), EDIT: It took 1119.51 seconds (18.66 minutes) to finish.

As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions