generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 180
Closed
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.
Description
While we depend on upstream model servers to support proper graceful drain (moving to a mode where the server terminates once all requests are completed, probably with a timeout although not always on very very long running servers), our examples and our docs should clearly indicate and configure the pool members for graceful drain.
I.e. the classic:
- Use a preStop hook to wait for load balancers to stop sending traffic (depends on the config of the fronting LB)
- Respond to SIGTERM in the model server process (e.g. vLLM) to begin draining and exit when completed
- Optionally letting the drain be unbounded for extremely long requests or cases where LB may have extremely long drain periods
- Write good log messages
- Ensure the readiness probe continues firing as long as the model server is accepting requests (for scenarios where the service is taking requests)
We should work with upstream vLLM to ensure they gracefully shut down and out of the box examples show it.
EDIT: vLLM does support drain on TERM
INFO 03-20 14:21:01 [launcher.py:74] Shutting down FastAPI HTTP server.
INFO: Shutting down
INFO: Waiting for connections to close. (CTRL+C to force quit)
So we are missing preStop in our examples (will test).
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.