Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640

Closed
red8888 opened this issue Feb 8, 2023 · 14 comments · Fixed by #1805, #2266 or #2262
Closed
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@red8888
Copy link

red8888 commented Feb 8, 2023

Problem Case

Consider the following scenario when running the proxy as a sidecar in k8s:

  1. term_timeout or max-sigterm-delay is set for the proxy container to 5 minutes
  2. Pod scales down
  3. The app container AND the proxy get a sigterm (cause they are in the same pod)
  4. App's cleanup routine requires it to make new connections to the DB after getting a sigterm
  5. The proxy will refuse new connections after getting a sigterm and will shutdown early if there are no open connections (which for this app there occasionally will be)
  6. This race condition shuts down the proxy before the app can access the DB and causes the app to fail to finish its sigterm cleanup routine

It doesn't matter what you set the term_timeout to because the app needs to initiate new connections after both it and the proxy get a sigterm

Alternatives Considered

Hacky scripty entrypoint command that swallows sigterm and forces the proxy to remain up for a static amount of time to like the app container call the new quitquitquit endpoint

Proposed solution

What I suggest is this: The new quitquitquit supports an idle_timeout setting in the request. So the process could work like this:

  1. term_timeout or max-sigterm-delay is set for the proxy process to 5 minutes
  2. Pod scales down
  3. The app container AND the proxy get a sigterm (cause they are in the same pod)
  4. App's cleanup routine immediately hits the quitquitquit endpoint of the proxy and sets idle_timeout to 30 seconds
  5. term_timeout is still in effect but the proxy will allow new connections for 30 seconds
  6. The app is now able to make a new DB connection to finish its cleanup routine
  7. After 30 secs there are no new connections and the proxy shuts down early
  8. Yay

Would it be easy to implement an arbitrary user set timer in the quitquitquit method?

@red8888 red8888 added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 8, 2023
@enocom enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Feb 8, 2023
@enocom
Copy link
Member

enocom commented Feb 9, 2023

I'm curious about this:

App's cleanup routine requires it to make new connections to the DB after getting a sigterm

Why is your app making a new database connection? Could you just re-use an existing connection with a connection pool for instance?

@GuillaumeCisco
Copy link

Python Celery workers when processing jobs open/close new connections.
So, setting max-sigterm-delay does not help here.

Would that be possible to make the clousql proxy alive until the main container die?
I've seen the trap thing on sigterm, but is there a less hacky way?

Maybe I'm missing something about configuring the way celery workers connect to cloudsql proxy with a connection pool. But it does not seems intuitive...

@enocom
Copy link
Member

enocom commented Apr 26, 2023

Have you tried using the /quitquitquit endpoint? That's intended specifically for the case when you want to control the shutdown outside of K8s usual lifecycle.

@JamesMcNee
Copy link

Hey @enocom -- Also encountering this as a change in behaviour between v1 and v2.

When using term_timeout in V1 as long as there was a connection active, new connections could be established. With V2 though and max-sigterm-delay even when an existing connection is active, new ones are rejected.

The main thing I can see this impacting is apps that are using connection pools as they will not be able to lease new connections, which they would have previously.

@GuillaumeCisco
Copy link

GuillaumeCisco commented Apr 27, 2023

Have you tried using the /quitquitquit endpoint? That's intended specifically for the case when you want to control the shutdown outside of K8s usual lifecycle.

I don't want to try to kill the cloudsql proxy from another container right now. But maybe that would be a solution to explore.
I succeeded after a lot of errors and retries to setup pgbouncer (which is a connection pool) on kubernetes as a sidecar for my celery workers.
I even created a repository to show how to do this: https://github.com/GuillaumeCisco/testpgbouncer

So I now have clousql proxy <- pgbouncer <- celery worker.
clousql proxy has a max-sigterm-delay set to 30s, as the default graceful period.

Now what happen is that all pods receive a SIGTERM.
First, pgbouncer is killed, as the SIGTERM does a hard kill, different from a SIGINT which does a soft kill aka uses PAUSE, accepting no more connections and waiting for active connections to close. And not accepting new connections even if some are still actives.
Looks like only the Session pooling could do the trick, but not the others ones:
See documentation: https://www.pgbouncer.org/usage.html

Session pooling
Most polite method. When a client connects, a server connection will be assigned to it for the whole duration the client stays connected. When the client disconnects, the server connection will be put back into the pool. This is the default method.

Transaction pooling
A server connection is assigned to a client only during a transaction. When PgBouncer notices that transaction is over, the server connection will be put back into the pool.

Statement pooling
Most aggressive method. The server connection will be put back into the pool immediately after a query completes. Multi-statement transactions are disallowed in this mode as they would break.
SIGINT
Safe shutdown. Same as issuing PAUSE and SHUTDOWN on the console.
SIGTERM
Immediate shutdown. Same as issuing SHUTDOWN on the console.
PAUSE [db]
PgBouncer tries to disconnect from all servers, first waiting for all queries to complete. The command will not return before all queries are finished. To be used at the time of database restart.

Then cloudsql-proxy is closed and my celery worker throws error as it is still in graceful shutdown and need to access the database for storing last results before quitting.
It bump me to the first issue again.

What is needed, is that "connection pool/cloudsql proxy" containers are shutdowned after the graceful period of the main container (celery worker in my case). This means potentially accepting new connections to the connection pool.
So I don't really understand how a connection pool can help here. Am I missing a fundamental piece of understanding?

So, maybe using the trap SIGTERM thing could help here for the connection pool part, but the connection pool pgbouncer do not implement a quitquitquit endpoint I could use for sending a new SIGTERM/SIGKILL when my worker finished its graceful shutdown. So I'm still a bit confused of how to handle this issue.

Maybe simply use the trap SIGTERM solution with a sleep of 30s ? which is not optimal.
Or using a prestop hook with a SIGINT and a sleep?

Thoughts?

@enocom
Copy link
Member

enocom commented Apr 27, 2023

In theory, we could change the behavior to accept new connections when the max-sigterm-delay is in effect.

I think it would amount to moving these lines down below the timer.

@enocom enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 27, 2023
@plaflamme
Copy link

Please see this post on this topic which refers to this Twitter thread from Tim Hockin which basically says that (at least within k8s) a pod / container should always keep accepting new connections even after a SIGTERM.

I found this very counterintuitive personally and this thread clarified things quite well: basically, shutting down gracefully is hard in k8s, so you can't really assume anything.

As for this particular container, since it can run in varying contexts, perhaps this behaviour should be a flag, e.g.: --tim-says-keep-serving

@enocom
Copy link
Member

enocom commented May 19, 2023

Thanks for the links @plaflamme. I'm convinced we can just fix this.

@enocom enocom added priority: p0 Highest priority. Critical issue. P0 implies highest priority. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels May 19, 2023
@enocom enocom changed the title Support forcing the proxy to accept new connections while term_timeout or max-sigterm-delay is in effect Support forcing the proxy to accept new connections while max-sigterm-delay is in effect May 22, 2023
enocom added a commit that referenced this issue May 22, 2023
When the proxy starts its shut down, it will now accept additional
connections when the max-sigterm-delay flag is passed. As before, if any
connections are still open after the delay has elapsed, the Proxy will
report an error.

Fixes #1640.
enocom added a commit that referenced this issue May 23, 2023
When the proxy starts its shut down, it will now accept additional
connections when the max-sigterm-delay flag is passed. As before, if any
connections are still open after the delay has elapsed, the Proxy will
report an error.

Fixes #1640.
@tcrasset
Copy link

I'm having the same issues as OP, and I don't think the main concern has been addressed. At least, this is not what I'm seeing on my end. Using version 2.8.1.

I'm also using celery in a container, but not pooling connections, so I'm opening a new connection on every call to the database/proxy. I need to do some cleanup during my celery shutdown process that calls the database.

What happens is that both the celery container and the proxy container receive a SIGTERM.

  • celery container and proxy container receives SIGTERM
  • the proxy, with --max-sigterm-delay 20s is supposed to wait for 20 seconds, but as there are no connections, it terminates early.
  • celery performs the shutdown procedure, tries to call the database, and errors out.

Note that I only call /quitquitquit after the celery container exits.

What I need is the proxy to keep serving for a certain period of time, and not exit early if there are no connections, which I think is what @plaflamme was suggesting.

@enocom
Copy link
Member

enocom commented Jan 12, 2024

I'm open to having the proxy wait the full duration regardless of how many open connections there are. Would that solve your use case @tcrasset?

@enocom enocom reopened this Jan 12, 2024
@tcrasset
Copy link

Yes, that would solve it! @enocom

@tcrasset
Copy link

I ended up using a preStop hook with the 2.8.1-bullseye image, which gives me access to a shell, so that I could do /bin/sh -c sleep 20

@enocom
Copy link
Member

enocom commented Jan 19, 2024

I think we can build-in support for that in the Proxy. We'll get to this.

@enocom enocom removed the priority: p0 Highest priority. Critical issue. P0 implies highest priority. label Jan 22, 2024
@enocom enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jan 22, 2024
@hessjcg hessjcg assigned hessjcg and unassigned enocom Jul 10, 2024
@hessjcg
Copy link
Collaborator

hessjcg commented Jul 10, 2024

I think we need a new flag. These are the existing exit behavior flags:

--max-sigterm-delay duration
--exit-zero-on-sigterm 
--quitquitquit

Let's add a flag called --min-sigterm-delay duration which makes the proxy wait a minimum number of seconds before initiating it's shutdown process. The default value is 0, meaning no additional waiting. Existing proxy exit behavior does not change.

The shutdown process:

  1. Proxy process receives an TERM or INTERUPT signal or /quitquitquit api call
  2. Exit handling function is invoked
  3. NEW Exit handling function sleeps for ${min-sigterm-delay} seconds. By default, the delay is 0 seconds, so no sleep.
  4. Exit handling function closes all sockets listening for new connections. No new connections allowed.
  5. Connector waits for all connections to close, up to ${max-sigterm-delay} seconds.
  6. Proxy exits with appropriate exit code according to ${exit-zero-on-sigterm}

hessjcg added a commit that referenced this issue Jul 11, 2024
…inimum number off seconds before shutting down the proxy. (#2266)

This makes the proxy continue to accept new connections for a period of time
after receiving the TERM signal.

Fixes #1640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
8 participants