-
Notifications
You must be signed in to change notification settings - Fork 12.1k
server: health: fix race condition on slots data using tasks queue #5634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: health: fix race condition on slots data using tasks queue #5634
Conversation
I think |
* include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.
Thanks 👍 done! |
Hi guys, I believe this might have introduced an issue. |
It works on my end: ./server -m models/llama-7b-v2/ggml-model-q4_0.gguf $ ▶ curl http://localhost:8080/health -H "Content-Type: application/json" -H "Authorization: Bearer no-key"
{"slots_idle":1,"slots_processing":0,"status":"ok"}
$ ▶ curl http://localhost:8080/health -H "Content-Type: application/json" -H "Authorization: Bearer no-key"
{"slots_idle":1,"slots_processing":0,"status":"ok"}
$ ▶ curl http://localhost:8080/health -H "Content-Type: application/json" -H "Authorization: Bearer no-key"
{"slots_idle":1,"slots_processing":0,"status":"ok"} Can you provide repro using curl commands? |
I tried this:
Depending on the prompt, it fails for 10s, if it's a bigger prompt, it will fail for many many cycles. |
Aha, got it. If the prompt processing lasts more than 1s, the query will timeout because the server is busy processing:
We probably have to rethink the task queueing logic |
Yes, the amount of time /health hangs depend on the number of tokens of the query, it seems. |
I can confirm the bug: When the server is busy doing something, the queue is blocked by Before this PR, we don't have this behavior because What I'm thinking is to delegate access to slot data to mutex inside |
Hi guys, following up the issue of blocking |
#5882 should improve the situation. Very large prompts will now be chunked in |
…gml-org#5634) * server: health: fix race condition on slots data using tasks queue * server: health: * include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.
…gml-org#5634) * server: health: fix race condition on slots data using tasks queue * server: health: * include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.
Issue
Changes on
/health
and/slots
endpoints in PRs #5550 and #5548 incorrectly access the slots internal data. It can create a race condition under some circonstances identified in #5566.Changes
Use the
tasks_queue
to access slots data in health checks. As of now access to health are synchronized, it will be useful to also include slots detail in the/health
response body with the query parameterinclude_slots
if except if--slots-endpoint-disable
is set.