Skip to content

feat: connection draining on stop CLI command#4443

Open
kaiburjack wants to merge 1 commit intovarnishcache:masterfrom
HBTGmbH:stop_drain
Open

feat: connection draining on stop CLI command#4443
kaiburjack wants to merge 1 commit intovarnishcache:masterfrom
HBTGmbH:stop_drain

Conversation

@kaiburjack
Copy link
Contributor

@kaiburjack kaiburjack commented Feb 5, 2026

I've added "connection draining" support to the stop CLI command, as proposed in this comment.
The stop CLI command now takes an optional parameter -t with an optional argument <duration>. When -t is specified it will start a "drain period" for at most shutdown_timeout duration during which no idle connections are closed but all requests are responded with an additional Connection: close header to make the client leave the connection.
When the argument to -t is specified, then this will instead be the effective shutdown timeout (and not the shutdown_timeout varnishd parameter).
The stop CLI command will also not block anymore.

Now, in order to do all this, we need an additional thread which monitors drain progress and an additional gauge to track how many active client/downstream connections we still have.
We also need to wake up idle workers (that are sitting in their 60s sleep) to make them release their VCL references quickly. This is also a pending issue since Varnish 8.0.0, that upon initiated shutdown, users get Child (PID) said shutdown waiting for N references on boot messages for at most a minute, if workers stay idle during shutdown when no new requests come in.

I've tried to implement this as minimally as possible and without breaking existing tests (I've run the test suite locally (on Linux Debian/Mint x64)) and added additional tests for the drain feature.
There were also a few corner cases of assertions failing because the manager now does not immediately reap the client but the client reaps itself.

Now, for my specific use-case: Kubernetes: When deploying this as a varnishd container process in a pod, then in order to actually make use of this draining, we need a preStop exec hook which issues varnishadm stop -t 30s and then repeatedly monitors varnishadm status to see when the client exited such that varnishd itself can get SIGTERM from the kubelet and the container and pod can exit.

@kaiburjack
Copy link
Contributor Author

I think this ubuntu noble test failure is something else. This particular job fails on basically every recent MR, including (where it is the only failed job):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant