Skip to content

server: health: fix race condition on slots data using tasks queue #5634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Feb 21, 2024

Issue

Changes on /health and /slots endpoints in PRs #5550 and #5548 incorrectly access the slots internal data. It can create a race condition under some circonstances identified in #5566.

Changes

Use the tasks_queue to access slots data in health checks. As of now access to health are synchronized, it will be useful to also include slots detail in the /health response body with the query parameter include_slots if except if --slots-endpoint-disable is set.

@phymbert phymbert requested a review from ggerganov February 21, 2024 13:02
@Artefact2
Copy link
Collaborator

I think include_slots should be disabled if --slots-endpoint-disable is set, considering it's a trivial bypass otherwise.

    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
@phymbert
Copy link
Collaborator Author

--slots-endpoint-disable

Thanks 👍 done!

@phymbert phymbert requested a review from ngxson February 21, 2024 13:43
@phymbert phymbert merged commit 1ecea25 into ggml-org:master Feb 21, 2024
@phymbert phymbert deleted the hotfix/server-health-endpoint-slots-race branch February 21, 2024 15:16
@phymbert phymbert mentioned this pull request Feb 21, 2024
28 tasks
@fcecagno
Copy link

Hi guys, I believe this might have introduced an issue.
I've tried c71d608 from a few weeks ago and then tried the latest build yesterday. When I use the API, the /health answer hangs, which is quite problematic since it's used to check the health of the application. If it hangs, it means the application might not be responding and could be killed/restarted. Have you tested this scenario? Is there any specific thing I can provide to help this issue to be fixed? This issue isn't present on c71d608.

@ggerganov
Copy link
Member

It works on my end:

./server -m models/llama-7b-v2/ggml-model-q4_0.gguf
$ ▶ curl http://localhost:8080/health -H "Content-Type: application/json" -H "Authorization: Bearer no-key"
{"slots_idle":1,"slots_processing":0,"status":"ok"}

$ ▶ curl http://localhost:8080/health -H "Content-Type: application/json" -H "Authorization: Bearer no-key"
{"slots_idle":1,"slots_processing":0,"status":"ok"}

$ ▶ curl http://localhost:8080/health -H "Content-Type: application/json" -H "Authorization: Bearer no-key"
{"slots_idle":1,"slots_processing":0,"status":"ok"}

Can you provide repro using curl commands?

@fcecagno
Copy link

I tried this:

while [ true ]; do date; curl --max-time 1 localhost:8080/health; echo "  -> $?"; sleep 1; done

Depending on the prompt, it fails for 10s, if it's a bigger prompt, it will fail for many many cycles.
If it works on your end, I'll try to provide a full step-to-step on how to reproduce it.

@ggerganov
Copy link
Member

Aha, got it. If the prompt processing lasts more than 1s, the query will timeout because the server is busy processing:

{"slots_idle":1,"slots_processing":0,"status":"ok"}  -> 0
Wed Feb 28 16:01:33 EET 2024
{"slots_idle":1,"slots_processing":0,"status":"ok"}  -> 0
Wed Feb 28 16:01:34 EET 2024
curl: (28) Operation timed out after 1006 milliseconds with 0 bytes received
  -> 28
Wed Feb 28 16:01:36 EET 2024
{"slots_idle":0,"slots_processing":1,"status":"no slot available"}  -> 0
Wed Feb 28 16:01:37 EET 2024
{"slots_idle":0,"slots_processing":1,"status":"no slot available"}  -> 0
Wed Feb 28 16:01:38 EET 2024
{"slots_idle":0,"slots_processing":1,"status":"no slot available"}  -> 0
Wed Feb 28 16:01:39 EET 2024
{"slots_idle":0,"slots_processing":1,"status":"no slot available"}  -> 0
Wed Feb 28 16:01:40 EET 2024
{"slots_idle":0,"slots_processing":1,"status":"no slot available"}  -> 0
Wed Feb 28 16:01:41 EET 2024
{"slots_idle":1,"slots_processing":0,"status":"ok"}  -> 0
Wed Feb 28 16:01:42 EET 2024

We probably have to rethink the task queueing logic

@fcecagno
Copy link

Yes, the amount of time /health hangs depend on the number of tokens of the query, it seems.

@ngxson
Copy link
Collaborator

ngxson commented Feb 28, 2024

I can confirm the bug: When the server is busy doing something, the queue is blocked by llama_decode. During this time, incoming requests will be queued instead of processing right away.

Before this PR, we don't have this behavior because /health does not depends on queue.

What I'm thinking is to delegate access to slot data to mutex inside llama_server_queue (i.e. adding a new function in llama_server_queue to get/set slot data). This way, we no longer need to push requests to /health into the queue.

@fcecagno
Copy link

fcecagno commented Mar 6, 2024

Hi guys, following up the issue of blocking /health, is there an issue already created for it?

@ggerganov
Copy link
Member

#5882 should improve the situation. Very large prompts will now be chunked in n_batch tokens and after each chunk the server should respond to /health requests. Please give it a try and see if it helps, but keep in mind there could be some bugs in that branch - not tested thoroughly yet

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
…gml-org#5634)

* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…gml-org#5634)

* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants