Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640

red8888 · 2023-02-08T20:19:35Z

Problem Case

Consider the following scenario when running the proxy as a sidecar in k8s:

term_timeout or max-sigterm-delay is set for the proxy container to 5 minutes
Pod scales down
The app container AND the proxy get a sigterm (cause they are in the same pod)
App's cleanup routine requires it to make new connections to the DB after getting a sigterm
The proxy will refuse new connections after getting a sigterm and will shutdown early if there are no open connections (which for this app there occasionally will be)
This race condition shuts down the proxy before the app can access the DB and causes the app to fail to finish its sigterm cleanup routine

It doesn't matter what you set the term_timeout to because the app needs to initiate new connections after both it and the proxy get a sigterm

Alternatives Considered

Hacky scripty entrypoint command that swallows sigterm and forces the proxy to remain up for a static amount of time to like the app container call the new quitquitquit endpoint

Proposed solution

What I suggest is this: The new quitquitquit supports an idle_timeout setting in the request. So the process could work like this:

term_timeout or max-sigterm-delay is set for the proxy process to 5 minutes
Pod scales down
The app container AND the proxy get a sigterm (cause they are in the same pod)
App's cleanup routine immediately hits the quitquitquit endpoint of the proxy and sets idle_timeout to 30 seconds
term_timeout is still in effect but the proxy will allow new connections for 30 seconds
The app is now able to make a new DB connection to finish its cleanup routine
After 30 secs there are no new connections and the proxy shuts down early
Yay

Would it be easy to implement an arbitrary user set timer in the quitquitquit method?

The text was updated successfully, but these errors were encountered:

enocom · 2023-02-09T20:30:37Z

I'm curious about this:

App's cleanup routine requires it to make new connections to the DB after getting a sigterm

Why is your app making a new database connection? Could you just re-use an existing connection with a connection pool for instance?

GuillaumeCisco · 2023-04-26T09:18:09Z

Python Celery workers when processing jobs open/close new connections.
So, setting max-sigterm-delay does not help here.

Would that be possible to make the clousql proxy alive until the main container die?
I've seen the trap thing on sigterm, but is there a less hacky way?

Maybe I'm missing something about configuring the way celery workers connect to cloudsql proxy with a connection pool. But it does not seems intuitive...

enocom · 2023-04-26T16:41:07Z

Have you tried using the /quitquitquit endpoint? That's intended specifically for the case when you want to control the shutdown outside of K8s usual lifecycle.

JamesMcNee · 2023-04-27T07:47:11Z

Hey @enocom -- Also encountering this as a change in behaviour between v1 and v2.

When using term_timeout in V1 as long as there was a connection active, new connections could be established. With V2 though and max-sigterm-delay even when an existing connection is active, new ones are rejected.

The main thing I can see this impacting is apps that are using connection pools as they will not be able to lease new connections, which they would have previously.

GuillaumeCisco · 2023-04-27T15:01:11Z

Have you tried using the /quitquitquit endpoint? That's intended specifically for the case when you want to control the shutdown outside of K8s usual lifecycle.

I don't want to try to kill the cloudsql proxy from another container right now. But maybe that would be a solution to explore.
I succeeded after a lot of errors and retries to setup pgbouncer (which is a connection pool) on kubernetes as a sidecar for my celery workers.
I even created a repository to show how to do this: https://github.com/GuillaumeCisco/testpgbouncer

So I now have clousql proxy <- pgbouncer <- celery worker.
clousql proxy has a max-sigterm-delay set to 30s, as the default graceful period.

Now what happen is that all pods receive a SIGTERM.
First, pgbouncer is killed, as the SIGTERM does a hard kill, different from a SIGINT which does a soft kill aka uses PAUSE, accepting no more connections and waiting for active connections to close. And not accepting new connections even if some are still actives.
Looks like only the Session pooling could do the trick, but not the others ones:
See documentation: https://www.pgbouncer.org/usage.html

Session pooling
Most polite method. When a client connects, a server connection will be assigned to it for the whole duration the client stays connected. When the client disconnects, the server connection will be put back into the pool. This is the default method.

Transaction pooling
A server connection is assigned to a client only during a transaction. When PgBouncer notices that transaction is over, the server connection will be put back into the pool.

Statement pooling
Most aggressive method. The server connection will be put back into the pool immediately after a query completes. Multi-statement transactions are disallowed in this mode as they would break.

SIGINT
Safe shutdown. Same as issuing PAUSE and SHUTDOWN on the console.
SIGTERM
Immediate shutdown. Same as issuing SHUTDOWN on the console.
PAUSE [db]
PgBouncer tries to disconnect from all servers, first waiting for all queries to complete. The command will not return before all queries are finished. To be used at the time of database restart.

Then cloudsql-proxy is closed and my celery worker throws error as it is still in graceful shutdown and need to access the database for storing last results before quitting.
It bump me to the first issue again.

What is needed, is that "connection pool/cloudsql proxy" containers are shutdowned after the graceful period of the main container (celery worker in my case). This means potentially accepting new connections to the connection pool.
So I don't really understand how a connection pool can help here. Am I missing a fundamental piece of understanding?

So, maybe using the trap SIGTERM thing could help here for the connection pool part, but the connection pool pgbouncer do not implement a quitquitquit endpoint I could use for sending a new SIGTERM/SIGKILL when my worker finished its graceful shutdown. So I'm still a bit confused of how to handle this issue.

Maybe simply use the trap SIGTERM solution with a sleep of 30s ? which is not optimal.
Or using a prestop hook with a SIGINT and a sleep?

Thoughts?

enocom · 2023-04-27T15:34:35Z

In theory, we could change the behavior to accept new connections when the max-sigterm-delay is in effect.

I think it would amount to moving these lines down below the timer.

plaflamme · 2023-05-19T13:09:16Z

Please see this post on this topic which refers to this Twitter thread from Tim Hockin which basically says that (at least within k8s) a pod / container should always keep accepting new connections even after a SIGTERM.

I found this very counterintuitive personally and this thread clarified things quite well: basically, shutting down gracefully is hard in k8s, so you can't really assume anything.

As for this particular container, since it can run in varying contexts, perhaps this behaviour should be a flag, e.g.: --tim-says-keep-serving

enocom · 2023-05-19T16:54:37Z

Thanks for the links @plaflamme. I'm convinced we can just fix this.

When the proxy starts its shut down, it will now accept additional connections when the max-sigterm-delay flag is passed. As before, if any connections are still open after the delay has elapsed, the Proxy will report an error. Fixes #1640.

tcrasset · 2024-01-11T19:48:01Z

I'm having the same issues as OP, and I don't think the main concern has been addressed. At least, this is not what I'm seeing on my end. Using version 2.8.1.

I'm also using celery in a container, but not pooling connections, so I'm opening a new connection on every call to the database/proxy. I need to do some cleanup during my celery shutdown process that calls the database.

What happens is that both the celery container and the proxy container receive a SIGTERM.

celery container and proxy container receives SIGTERM
the proxy, with --max-sigterm-delay 20s is supposed to wait for 20 seconds, but as there are no connections, it terminates early.
celery performs the shutdown procedure, tries to call the database, and errors out.

Note that I only call /quitquitquit after the celery container exits.

What I need is the proxy to keep serving for a certain period of time, and not exit early if there are no connections, which I think is what @plaflamme was suggesting.

enocom · 2024-01-12T01:12:58Z

I'm open to having the proxy wait the full duration regardless of how many open connections there are. Would that solve your use case @tcrasset?

tcrasset · 2024-01-12T09:43:24Z

Yes, that would solve it! @enocom

tcrasset · 2024-01-15T10:41:18Z

I ended up using a preStop hook with the 2.8.1-bullseye image, which gives me access to a shell, so that I could do /bin/sh -c sleep 20

enocom · 2024-01-19T22:04:33Z

I think we can build-in support for that in the Proxy. We'll get to this.

hessjcg · 2024-07-10T17:04:11Z

I think we need a new flag. These are the existing exit behavior flags:

--max-sigterm-delay duration
--exit-zero-on-sigterm 
--quitquitquit

Let's add a flag called --min-sigterm-delay duration which makes the proxy wait a minimum number of seconds before initiating it's shutdown process. The default value is 0, meaning no additional waiting. Existing proxy exit behavior does not change.

The shutdown process:

Proxy process receives an TERM or INTERUPT signal or /quitquitquit api call
Exit handling function is invoked
NEW Exit handling function sleeps for ${min-sigterm-delay} seconds. By default, the delay is 0 seconds, so no sleep.
Exit handling function closes all sockets listening for new connections. No new connections allowed.
Connector waits for all connections to close, up to ${max-sigterm-delay} seconds.
Proxy exits with appropriate exit code according to ${exit-zero-on-sigterm}

…inimum number off seconds before shutting down the proxy. (#2266) This makes the proxy continue to accept new connections for a period of time after receiving the TERM signal. Fixes #1640

red8888 added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 8, 2023

blunderbuss-gcf bot assigned jackwotherspoon Feb 8, 2023

red8888 mentioned this issue Feb 8, 2023

Provide a way to shutdown the process from another container #828

Closed

jackwotherspoon assigned enocom and unassigned jackwotherspoon Feb 8, 2023

enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Feb 8, 2023

enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 27, 2023

enocom added priority: p0 Highest priority. Critical issue. P0 implies highest priority. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels May 19, 2023

enocom changed the title ~~Support forcing the proxy to accept new connections while term_timeout or max-sigterm-delay is in effect~~ Support forcing the proxy to accept new connections while max-sigterm-delay is in effect May 22, 2023

enocom mentioned this issue May 22, 2023

feat: allow connections during shutdown #1805

Merged

enocom closed this as completed in #1805 May 23, 2023

enocom reopened this Jan 12, 2024

enocom removed the priority: p0 Highest priority. Critical issue. P0 implies highest priority. label Jan 22, 2024

enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jan 22, 2024

hessjcg assigned hessjcg and unassigned enocom Jul 10, 2024

hessjcg mentioned this issue Jul 10, 2024

feat: Add parameter --min-sigterm-delay #2266

Merged

hessjcg closed this as completed in #2266 Jul 11, 2024

release-please bot mentioned this issue Jul 11, 2024

chore(main): release 2.12.0 #2262

Merged

BrewTestBot mentioned this issue Jul 17, 2024

cloud-sql-proxy 2.12.0 Homebrew/homebrew-core#177653

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640

Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640

red8888 commented Feb 8, 2023

enocom commented Feb 9, 2023

GuillaumeCisco commented Apr 26, 2023

enocom commented Apr 26, 2023

JamesMcNee commented Apr 27, 2023

GuillaumeCisco commented Apr 27, 2023 •

edited

Loading

enocom commented Apr 27, 2023

plaflamme commented May 19, 2023

enocom commented May 19, 2023

tcrasset commented Jan 11, 2024

enocom commented Jan 12, 2024 •

edited

Loading

tcrasset commented Jan 12, 2024

tcrasset commented Jan 15, 2024

enocom commented Jan 19, 2024

hessjcg commented Jul 10, 2024 •

edited

Loading

Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640

Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640

Comments

red8888 commented Feb 8, 2023

Problem Case

Alternatives Considered

Proposed solution

enocom commented Feb 9, 2023

GuillaumeCisco commented Apr 26, 2023

enocom commented Apr 26, 2023

JamesMcNee commented Apr 27, 2023

GuillaumeCisco commented Apr 27, 2023 • edited Loading

enocom commented Apr 27, 2023

plaflamme commented May 19, 2023

enocom commented May 19, 2023

tcrasset commented Jan 11, 2024

enocom commented Jan 12, 2024 • edited Loading

tcrasset commented Jan 12, 2024

tcrasset commented Jan 15, 2024

enocom commented Jan 19, 2024

hessjcg commented Jul 10, 2024 • edited Loading

GuillaumeCisco commented Apr 27, 2023 •

edited

Loading

enocom commented Jan 12, 2024 •

edited

Loading

hessjcg commented Jul 10, 2024 •

edited

Loading