Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Fix systemd race condition when restarting mender.
Although somewhat rare, it seems to have more than 50% chance of happening at least once in one of the state script tests of an integration test run. What's happening is that when running `systemctl restart mender-updated` from an `ArtifactReboot` script, systemd kills the whole control group, including the script. This is fine in itself, but if the script happens to terminate before Mender does, then it will be recorded as an error, and the Mender will start on its error path. What happens afterwards depends on how far it gets before it is also killed. Usually it will not get further than executing the first `ArtifactReboot_Error` script, but it could potentially go all the way to a rollback. Either of those is wrong. The issue won't affect users of `rootfs-image`, since it uses `NeedsArtifactReboot=Automatic`, which doesn't call the update module's `ArtifactReboot`, but it can affect other means of running `ArtifactReboot`, such as restarting it with systemctl after a package upgrade. The best way to mitigate this is to make sure the script survives longer than Mender. This can be done in the script itself with a shell `trap` or similar, since systemd sends SIGTERM first. But in order to make this less surprising for users, switch systemd to kill the client first in all cases, leaving scripts to be killed only if the termination times out and it has to resort to SIGKILL. This started appearing with Yocto scarthgap, and why it has appeared now is anyone's guess, it could be multiple reasons: * Exact killing pattern of systemd might have changed slightly. * The new kernel might kill processes in the same control group slightly differently. Whatever the reason, it causes the script to sometimes terminate before Mender, causing the issue. Changelog: Fix systemd race condition when restarting mender from `ArtifactReboot` script. The symptom would be an error message like: ``` Process returned non-zero exit status: ArtifactReboot: Process exited with status 15 ``` And the `ArtifactReboot_Error` state scripts would be executed, even though they should not. Ticket: None Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
- Loading branch information