Skip to content

Commit

Permalink
fix: Fix systemd race condition when restarting mender.
Browse files Browse the repository at this point in the history
Although somewhat rare, it seems to have more than 50% chance of
happening at least once in one of the state script tests of an
integration test run. What's happening is that when running `systemctl
restart mender-updated` from an `ArtifactReboot` script, systemd kills
the whole control group, including the script. This is fine in itself,
but if the script happens to terminate before Mender does, then it
will be recorded as an error, and the Mender will start on its error
path. What happens afterwards depends on how far it gets before it is
also killed. Usually it will not get further than executing the first
`ArtifactReboot_Error` script, but it could potentially go all the way
to a rollback. Either of those is wrong.

The issue won't affect users of `rootfs-image`, since it uses
`NeedsArtifactReboot=Automatic`, which doesn't call the update
module's `ArtifactReboot`, but it can affect other means of running
`ArtifactReboot`, such as restarting it with systemctl after a package
upgrade.

The best way to mitigate this is to make sure the script survives
longer than Mender. This can be done in the script itself with a shell
`trap` or similar, since systemd sends SIGTERM first. But in order to
make this less surprising for users, switch systemd to kill the client
first in all cases, leaving scripts to be killed only if the
termination times out and it has to resort to SIGKILL.

This started appearing with Yocto scarthgap, and why it has appeared
now is anyone's guess, it could be multiple reasons:
* Exact killing pattern of systemd might have changed slightly.
* The new kernel might kill processes in the same control group
  slightly differently.
Whatever the reason, it causes the script to sometimes terminate
before Mender, causing the issue.

Changelog: Fix systemd race condition when restarting mender from
`ArtifactReboot` script. The symptom would be an error message like:
```
Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
```
And the `ArtifactReboot_Error` state scripts would be executed, even
though they should not.

Ticket: None

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
  • Loading branch information
kacf committed Aug 9, 2024
1 parent 949a7a9 commit 5e553cf
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions support/mender-updated.service
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Group=root
ExecStart=/usr/bin/mender-update daemon
Restart=always
WatchdogSec=86400
KillMode=mixed

[Install]
WantedBy=multi-user.target

0 comments on commit 5e553cf

Please sign in to comment.