server: enhanced health endpoint #5548

phymbert · 2024-02-17T11:50:17Z

Context
It can be useful to monitor the server slots activity, especially when there is no slot available. It will allow in the context of a llama servers cluster to route incoming request to instance with available slots, for example when using kubernetes probes.

Proposed changes
Add slots_idle and slots_processing fields in the health endpoint response, answer 503 if not slot are available.

Closes #4746

…t slots are available

brittlewis12 · 2024-02-18T17:38:00Z

examples/server/server.cpp

+                                {"slots_idle",       available_slots},
+                                {"slots_processing", processing_slots}};
+                        res.set_content(health.dump(), "application/json");
+                        res.status = 503; // HTTP Service Unavailable


@phymbert thanks for introducing this additional metadata to the health check!

one nit: it seems unidiomatic for health to return an error status code for an expected and error-free state. in practice, for a local inference server with a single slot (the default behavior), this is particularly unintuitive.

while the server is busy wrt inference, it can happily process health check requests — why return an error (5xx) status code, rather than a success (request understood and processed just fine) along with the actual information desired, the count of available slots (0)?

503 or 409 conflict make more sense to me for /completion or chat completion requests — their request can genuinely not be processed. but the health check returning 5xx codes during normal operation feels wrong to me. the server is not unhealthy by any metric.

it seems this is not an uncommon point of bike shedding so I will happily work around this behavior if i'm in the minority, but wanted to share in case there was any other agreement to this effect.

happy to put up a patch if so!

hi @brittlewis12, thanks for your feedback.

My primary goal is to point a kubernetes readiness probes to the health endpoint. This way, the server will not receive new incoming request but they will be routed to another available pod. It does not mean the server is down, but as 503 says: it is overloaded. This is the standard for cloud native application.

@brittlewis12 I finally got your point, PR #5594 address it, thanks for pointing me out this.

* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md

phymbert added 2 commits February 17, 2024 12:45

server: enrich health endpoint with available slots, return 503 if no…

fb1c1d0

…t slots are available

server: document new status no slot available in the README.md

ce8fe5d

ggerganov approved these changes Feb 18, 2024

View reviewed changes

ggerganov merged commit e75c627 into ggml-org:master Feb 18, 2024

brittlewis12 reviewed Feb 18, 2024

View reviewed changes

phymbert deleted the feature/server-better-healthcheck branch February 18, 2024 18:40

This was referenced Feb 19, 2024

server: health endpoint configurable failure on no slot #5594

Merged

server: init functional tests #5566

Merged

server: health: fix race condition on slots data using tasks queue #5634

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: enhanced health endpoint #5548

server: enhanced health endpoint #5548

Uh oh!

phymbert commented Feb 17, 2024 •

edited

Loading

Uh oh!

brittlewis12 Feb 18, 2024

Uh oh!

phymbert Feb 18, 2024

Uh oh!

phymbert Feb 19, 2024

Uh oh!

Uh oh!

server: enhanced health endpoint #5548

server: enhanced health endpoint #5548

Uh oh!

Conversation

phymbert commented Feb 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brittlewis12 Feb 18, 2024

Choose a reason for hiding this comment

Uh oh!

phymbert Feb 18, 2024

Choose a reason for hiding this comment

Uh oh!

phymbert Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phymbert commented Feb 17, 2024 •

edited

Loading