Zero-downtime deployments #194

kevinlul · 2022-10-10T19:40:41Z

A new release to the live Bastion could cause up to a minute of downtime, rounding up, due to stopping the bot process and then starting up a new process, which reconnects shard-by-shard to the Discord gateway to not be rate-limited. To have no downtime, deployments need to start up the new process first, make sure that no duplicate responses happen in the deployment period while there are two processes running, then stop the old process once the new process has connected all shards to the gateway. Two things must be implemented for this to happen: container start before stop and a lock manager.

Container start before stop

In Swarm, this is configured as deploy.update_config.order: start-first (docs). This is not supported by Compose v1 or v2. Therefore, to use this, production must be switched to a single-node Swarm. (Not yet at the scale where additional benefits are reaped from separating shards into their own processes.)

A pure Compose solution could be to use different project names with each deployment, as long as previous project names are kept track of so the old stack can be taken down.

Lock manager

Since Bastion is not yet at the scale for sharding across multiple hosts, the fastest solution should be write-ahead-log SQLite. Before processing a message or interaction, attempt to INSERT the snowflake into a table. Continue only if this succeeds, as we have the lock. If not, a different process has taken the lock. In the general case, this kind of overhead could help with Discord eventual consistency, though this has never been a problem in practice (receiving an event N times). This overhead could also be limited to the deployment window, being toggled upon receiving a certain Unix signal.

Caveats

VM memory demands increase since it must support two bot Node.js containers running during the deployment window. The addition of <> card search has already increased memory demands.

In general (not just the zero-downtime case), how button timeouts behave across a redeployment or a bot restart should be considered.

The text was updated successfully, but these errors were encountered:

kevinlul · 2022-11-02T03:32:25Z

There should now be RAM capacity for this.

#194 Node.js uses both SIGUSR1 and SIGUSR2 (https://nodejs.org/api/process.html) so base our signals on Gunicorn (https://docs.gunicorn.org/en/stable/signals.html)

#194

#194 #222

kevinlul · 2022-11-08T01:15:19Z

Signal mechanism is insufficient in case of container restarts, since there will be nothing to signal the EventLocker off. Should add a timeout from boot instead.

kevinlul · 2023-07-08T02:57:51Z

https://github.com/Wowu/docker-rollout New script that's very well done. This can be incorporated to start the new container before removing the first under Compose, without switching to a single-node Swarm. Installation should be version-fixed with checksum like docker-stack-wait for Swarm.

kevinlul · 2023-07-08T03:27:39Z

(Optional) Add a third CLI parameter to specify the lock database location. Its presence enables EventLocker. This allows specifying a separate tmpfs, so there's never a disk write. To share the lock database between Docker containers, we can't use Docker's tmpfs volume type as it can't be shared. Bind mount the host's /dev/shm.

kevinlul · 2023-07-08T03:30:28Z

To properly assess the state of the containers, a healthcheck is needed. Create an HTTP healthcheck using built-in node:http, responding 200 OK if and only if all Discord bot shards are ready. This HTTP server may listen on a TCP port or a Unix socket in /run or /tmp. The Dockerfile should have a healthcheck for this endpoint. The additional timeout EventLocker shutoff mechanism should only start once all shards are ready, if possible.

kevinlul · 2024-04-30T23:50:16Z

Alternate concept based on https://github.com/meister03/discord-hybrid-sharding:

Add a CLI parameter that starts the bot in standby, with the current timestamp as the value. When the bot is started in standby, only a base set of event listeners are registered (warn, error, shard&*, ready): https://github.com/DawnbrandBots/bastion-bot/blob/master/src/bot.ts

Since the bot program will start in standby, the new instance will not start handling events while the old instance is still up. Once the ready event is emitted, the deployment system can detect this and simultaneously signal the new bot to become active while shutting off the old bot. The new bot becomes active by registering all remaining event listeners (guildCreate, guildDelete, listeners array containing interaction, messageCreate, messageDelete). At this point, the takeover and deployment is complete.

Should the bot process crash or otherwise be restarted, it can calculate its start time and compare this with the timestamp in the CLI parameter, If the delay is too high, then it knows it has been restarted and can start up in active mode instead of hanging in standby when there's no deployment system.

kevinlul added a commit that referenced this issue Nov 4, 2022

Test start-first deployment in preview

7c3680d

#194 Node.js uses both SIGUSR1 and SIGUSR2 (https://nodejs.org/api/process.html) so base our signals on Gunicorn (https://docs.gunicorn.org/en/stable/signals.html)

kevinlul added a commit that referenced this issue Nov 4, 2022

Make sending SIGTTIN optional

5f7d159

#194

kevinlul added a commit that referenced this issue Nov 4, 2022

Notify on SIGTTIN and SIGTTOU in preparation for lock manager

d151142

#194

kevinlul added a commit that referenced this issue Nov 5, 2022

Implement EventLocker

a9e24eb

#194

kevinlul added a commit that referenced this issue Nov 5, 2022

Fix unit tests

44aa6ee

#194

kevinlul self-assigned this Nov 6, 2022

kevinlul added a commit that referenced this issue Nov 7, 2022

Update Swarm release workflow to match

faf8aa5

#194

kevinlul added a commit that referenced this issue Nov 7, 2022

Enable EventLocker in production

2b11008

#194

kevinlul added a commit that referenced this issue Nov 7, 2022

Fix typo in Swarm release workflow

0d09052

#194 #222

kevinlul removed their assignment Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-downtime deployments #194

Zero-downtime deployments #194

kevinlul commented Oct 10, 2022

kevinlul commented Nov 2, 2022

kevinlul commented Nov 8, 2022

kevinlul commented Jul 8, 2023

kevinlul commented Jul 8, 2023

kevinlul commented Jul 8, 2023

kevinlul commented Apr 30, 2024

Zero-downtime deployments #194

Zero-downtime deployments #194

Comments

kevinlul commented Oct 10, 2022

Container start before stop

Lock manager

Caveats

kevinlul commented Nov 2, 2022

kevinlul commented Nov 8, 2022

kevinlul commented Jul 8, 2023

kevinlul commented Jul 8, 2023

kevinlul commented Jul 8, 2023

kevinlul commented Apr 30, 2024