Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate warp-systemd for more reliability in production deployments #2027

Merged
merged 2 commits into from
Jan 24, 2025

Conversation

mpscholten
Copy link
Member

@mpscholten mpscholten commented Jan 23, 2025

@mpscholten mpscholten merged commit 2f25258 into master Jan 24, 2025
2 checks passed
@mpscholten mpscholten deleted the warp-systemd branch January 24, 2025 01:00
@amitaibu
Copy link
Collaborator

@mpscholten Thanks. I think I don't fully understand how it works 😅

  1. How would you use this? (Seems there's an env disabled by default).
  2. The watchdog seems to be checking a local URL (http://127.0.0.1) - is that correct?
  3. Finally, what happens if the healthcheck fails?

@mpscholten
Copy link
Member Author

The cool thing of this change is that everyone using ihp.nixosModules.app with deploy-to-nixos, e.g. like this

flake.nixosConfigurations."netcup-customerlink" = nixpkgs.lib.nixosSystem {
    system = "aarch64-linux";
    specialArgs = inputs;
    modules = [
        ihp.nixosModules.app

will have this enabled by default. It's disabled by default in the haskell part, but in the nixos part we set IHP_SYSTEMD = "1";. So when it's deployed via nixos it's enabled by default. When the IHP app is not deployed that way, it needs to be manually configured by registering a systemd socket and configuring the systemd watchdog manually.

The local url is correct. The app is basically pinging itself on localhost to make sure that it's still running. I dealt with a crash earlier today where the warp web server got stuck (I think it's the same issue described in https://blog.cachix.org/posts/2020-12-23-post-mortem-recent-downtime/). This would have been prevented by such a check. (We cannot use the actual host/domain of the server, as sometimes e.g. the DNS records are not yet updated, and then we don't want the server to run a restart loop).

When the healthcheck fails, the app will be restarted. systemd is now configured to expect a heartbeat every 60 secs. The app is delivering a heartbeat every 30 secs. So in case the app get's stuck, it will be restarted after 60 seconds.

There's also some other "hidden gems" of this change: we now use socket activation of systemd. Basically systemd now listens on the port when the server boots. And only when the first request comes in, systemd will start the IHP server and pass the socket to the server. This also comes handy when e.g. restarting the server. During the restart, any incoming HTTP requests will be queued by systemd and once the restart has finished, the IHP app will pick up the queued requests. Previously when the IHP app was restarted, there is a small window where the server is unreachable.

@amitaibu
Copy link
Collaborator

THanks, that's cool. I'll create a PR to add some info to the docs 😸

One thing I didn't understand:

And only when the first request comes in, systemd will start the IHP server and pass the socket to the server.

What's the advantage here? Sounds like the first request will take more time to get a response, no?

@mpscholten
Copy link
Member Author

thanks 🙌

i think i was not fully correct in my previous statement. The lazy start behavior is only true generally for systemd, but the IHP service is configured to start anyways on system boot time. so in our case there is no delay. The real advantage is the zero downtime restarts.

@mpscholten
Copy link
Member Author

http://0pointer.de/blog/projects/socket-activation.html this is a good resource on the topic (some parts a bit too deep in the details, but you get the gist)

amitaibu added a commit that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants