-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : refactor middleware and /health endpoint #9056
Merged
ngxson
merged 6 commits into
ggerganov:master
from
ngxson:wsn/server_health_non_blocking
Aug 16, 2024
Merged
server : refactor middleware and /health endpoint #9056
ngxson
merged 6 commits into
ggerganov:master
from
ngxson:wsn/server_health_non_blocking
Aug 16, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggerganov
approved these changes
Aug 16, 2024
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
I started a discussion thread related to this issue, please take a look: #9276 |
ngxson
added
the
breaking change
Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility.
label
Sep 2, 2024
arthw
pushed a commit
to arthw/llama.cpp
that referenced
this pull request
Nov 15, 2024
* server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw
pushed a commit
to arthw/llama.cpp
that referenced
this pull request
Nov 18, 2024
* server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breaking change
Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility.
examples
python
python script changes
server
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
/health
endpointIn the beginning,
/health
endpoint was used to retrieve slots state. That was because at the time,/completions
endpoint returns an error if there is no slot available. Therefore,/health
was used to allow the application to wait until one slot is available.Nowadays, the server now can queue (defer) the request if no slots is available.
/health
is used by docker for health checking. This is now become a problem when the server is busy doing a long task,/health
can timeout. On HF inference endpoint, this causes the container to be in unhealthy state, which triggers a force restart.Therefore, I propose a cleaner usage:
GET /health
is now purely used to report actual healthGET /slots
can be used as a replacement to get slot stateAs a consequence,
/health?fail_on_no_slot=1
is also moved to/slots?fail_on_no_slot=1
(for compatibility, we keep this option)Refactor middleware
Some repeated code blocks, for example setting
Access-Control-Allow-Origin
, is now moved to middleware.Middleware now also responsible to return error if the server is not yet ready:
When the server starts, if the model is being loaded, accessing to any endpoint will result in 503 error code:
Behavior on loading model failed
If model fails to load (for example, file does not exist), the server will simply exit with status code 1. This resolves #7787 where user reports that loading invalid model causes the server to crash.