Description
LocalAI version:
image: quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas
6:51AM INF Starting LocalAI using 32 threads, with models path: /build/models
6:51AM INF LocalAI version: v2.27.0 (6d7ac09)
Environment, CPU architecture, OS, and Version:
Linux Jarvis 6.14.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 10 Apr 2025 18:43:59 +0000 x86_64 GNU/Linux
(EndeavourOS)
Describe the bug
the log files show what happened, but... the Bug is that the idle watchdog kills the process when the model is doing longchain deep reasoning in the middle of it's reasoning.
To Reproduce
install the model (checked, it works with the q4 version of nous_deephermes-3...gguf from the model database).
enable deep-reasoning mode using ou are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
in either the system prompt or at the beginning of the user's input prompt.
Then ask it a question and watch it begin reasoning. Depending on where in the timeframe you are at, you won't have to wait long.
Expected behavior
I expected it to NOT kill the model while it was busy. I have the IDLE_WATCHDOG enabled because the llama.cpp backend doesn't do cleanup of the context window and things quickly blow up because of OOM's.
# Enables watchdog to kill backends that are inactive for too much time
LOCALAI_WATCHDOG_IDLE=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
LOCALAI_WATCHDOG_IDLE_TIMEOUT=1m
Logs
7:35AM INF BackendLoader starting backend=llama-cpp modelID=nousresearch_deephermes-3-mistral-24b-preview o.model=DeepHermes-3-Mistral-24B-Preview-q5.gguf
7:35AM INF Success ip=172.20.0.1 latency=14.301568308s method=POST status=200 url=/v1/chat/completions
7:36AM INF Success ip=127.0.0.1 latency="106.63µs" method=GET status=200 url=/readyz
7:36AM INF Success ip=172.20.0.1 latency=3.860199651s method=POST status=200 url=/v1/chat/completions
7:37AM INF Success ip=172.20.0.1 latency=2.863545717s method=POST status=200 url=/v1/chat/completions
7:38AM INF Success ip=172.20.0.1 latency=6.807149169s method=POST status=200 url=/v1/chat/completions
7:38AM INF Success ip=172.20.0.1 latency=2.027680206s method=POST status=200 url=/v1/chat/completions
7:39AM INF Success ip=127.0.0.1 latency="44.82µs" method=GET status=200 url=/readyz
7:40AM INF Success ip=172.20.0.1 latency=2.011001385s method=POST status=200 url=/v1/chat/completions
7:41AM INF Success ip=172.20.0.1 latency=8.503882328s method=POST status=200 url=/v1/chat/completions
7:42AM INF Success ip=127.0.0.1 latency="84.381µs" method=GET status=200 url=/readyz
7:43AM INF Success ip=172.20.0.1 latency=1.921952499s method=POST status=200 url=/v1/chat/completions
7:44AM INF Success ip=172.20.0.1 latency=9.305812665s method=POST status=200 url=/v1/chat/completions
7:45AM INF Success ip=127.0.0.1 latency="32.777µs" method=GET status=200 url=/readyz
7:47AM WRN [WatchDog] Address 127.0.0.1:46587 is idle for too long, killing it
Error rpc error: code = Unavailable desc = error reading from server: EOF
Error rpc error: code = Unavailable desc = error reading from server: EOF
Error rpc error: code = Unavailable desc = error reading from server: EOF
Additional context
Possible context:
#
# Enables watchdog to kill backends that are busy for too much time
LOCALAI_WATCHDOG_BUSY=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
LOCALAI_WATCHDOG_BUSY_TIMEOUT=25m
With the new deep reasoning models, I would see this as an error that should have a medium priority (not a fatal error, and most other models work decently well with an in-home server lab)...
And, I see two possible avenues for a fix - change the backend to do better clean up of the context window / overhead between calls, or fix the idle part of the watchdog to actually determine if the model is in use in vram. O, both. :P
Anyways, stumbled across this, and kinda need it fixed if my grand plans of building a tech empire are to ever come to fruition.
Thank you for your time and attention to this matter!