integrate warp-systemd for more reliability in production deployments #2027

mpscholten · 2025-01-23T20:10:18Z

Found this package when looking at https://blog.cachix.org/posts/2020-12-23-post-mortem-recent-downtime/

amitaibu · 2025-01-24T05:57:06Z

@mpscholten Thanks. I think I don't fully understand how it works 😅

How would you use this? (Seems there's an env disabled by default).
The watchdog seems to be checking a local URL (http://127.0.0.1) - is that correct?
Finally, what happens if the healthcheck fails?

mpscholten · 2025-01-24T06:13:37Z

The cool thing of this change is that everyone using ihp.nixosModules.app with deploy-to-nixos, e.g. like this

flake.nixosConfigurations."netcup-customerlink" = nixpkgs.lib.nixosSystem {
    system = "aarch64-linux";
    specialArgs = inputs;
    modules = [
        ihp.nixosModules.app

will have this enabled by default. It's disabled by default in the haskell part, but in the nixos part we set IHP_SYSTEMD = "1";. So when it's deployed via nixos it's enabled by default. When the IHP app is not deployed that way, it needs to be manually configured by registering a systemd socket and configuring the systemd watchdog manually.

The local url is correct. The app is basically pinging itself on localhost to make sure that it's still running. I dealt with a crash earlier today where the warp web server got stuck (I think it's the same issue described in https://blog.cachix.org/posts/2020-12-23-post-mortem-recent-downtime/). This would have been prevented by such a check. (We cannot use the actual host/domain of the server, as sometimes e.g. the DNS records are not yet updated, and then we don't want the server to run a restart loop).

When the healthcheck fails, the app will be restarted. systemd is now configured to expect a heartbeat every 60 secs. The app is delivering a heartbeat every 30 secs. So in case the app get's stuck, it will be restarted after 60 seconds.

There's also some other "hidden gems" of this change: we now use socket activation of systemd. Basically systemd now listens on the port when the server boots. And only when the first request comes in, systemd will start the IHP server and pass the socket to the server. This also comes handy when e.g. restarting the server. During the restart, any incoming HTTP requests will be queued by systemd and once the restart has finished, the IHP app will pick up the queued requests. Previously when the IHP app was restarted, there is a small window where the server is unreachable.

amitaibu · 2025-01-24T06:17:11Z

THanks, that's cool. I'll create a PR to add some info to the docs 😸

One thing I didn't understand:

And only when the first request comes in, systemd will start the IHP server and pass the socket to the server.

What's the advantage here? Sounds like the first request will take more time to get a response, no?

mpscholten · 2025-01-24T06:29:54Z

thanks 🙌

i think i was not fully correct in my previous statement. The lazy start behavior is only true generally for systemd, but the IHP service is configured to start anyways on system boot time. so in our case there is no delay. The real advantage is the zero downtime restarts.

mpscholten · 2025-01-24T06:34:11Z

http://0pointer.de/blog/projects/socket-activation.html this is a good resource on the topic (some parts a bit too deep in the details, but you get the gist)

Follow up to #2027

mpscholten added 2 commits January 23, 2025 12:04

integrate warp-systemd for more reliability in production deployments

d0cf5c6

added dedicated healthcheck endpoint

7b685e6

mpscholten merged commit 2f25258 into master Jan 24, 2025
2 checks passed

mpscholten deleted the warp-systemd branch January 24, 2025 01:00

amitaibu added a commit that referenced this pull request Jan 24, 2025

Docs on systemd integration

410cd00

Follow up to #2027

amitaibu mentioned this pull request Jan 24, 2025

Docs on systemd integration #2028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integrate warp-systemd for more reliability in production deployments #2027

integrate warp-systemd for more reliability in production deployments #2027

mpscholten commented Jan 23, 2025 •

edited

Loading

amitaibu commented Jan 24, 2025

mpscholten commented Jan 24, 2025

amitaibu commented Jan 24, 2025

mpscholten commented Jan 24, 2025

mpscholten commented Jan 24, 2025

integrate warp-systemd for more reliability in production deployments #2027

integrate warp-systemd for more reliability in production deployments #2027

Conversation

mpscholten commented Jan 23, 2025 • edited Loading

amitaibu commented Jan 24, 2025

mpscholten commented Jan 24, 2025

amitaibu commented Jan 24, 2025

mpscholten commented Jan 24, 2025

mpscholten commented Jan 24, 2025

mpscholten commented Jan 23, 2025 •

edited

Loading