-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4.0.x cherry-picks #1655
4.0.x cherry-picks #1655
Conversation
Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 660eeb8)
The problem is that once we have entered the `sync_leave_download` internal state, a deployment has started, and a deployment always has to be ended, to properly clean up. This is fine when `sync_leave_download` is successful and proceeds into subsequent download states, which always end with the `end_of_deployment_state`. But if `sync_leave_download` fails, then it goes directly to `sync_error`, which does not know about the deployment, and does not go to `end_of_deployment_state`. Fix this by creating a dedicated `sync_error_download` state. This bug happened only occasionally in the test `test_state_scripts[Corrupted_script_version_in_etc-test_set11]` in the integration repository, typically under load. The reason it happened only occasionally is that the test is programmed to fail in `Sync_Leave` only once, and inventory submission is always run first. So by the time it got to the deployment it was always returning success. But under load it could happen that the inventory submission failed, which would run `Sync_Error` instead and skip `Sync_Leave`, leaving it around for the upcoming deployment, where the bug would then occur. Testing this in unit tests requires supporting more than one deployment run, which requires an extra member in the exit state. In the spirit of trying to limit space requirements for embedded system, I've made this member, which is only used for tests, debug-only. Changelog: Fix crash when `Sync_Leave` returns error during a deployment. The error message would be: ``` State machine event DeploymentStarted was not handled by any transition ``` and would happen on the next deployment following the `Sync_Leave` error. With a long polling interval, this could cause the bug to be latent for quite a while. Ticket: MEN-7379 Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 949a7a9)
Although somewhat rare, it seems to have more than 50% chance of happening at least once in one of the state script tests of an integration test run. What's happening is that when running `systemctl restart mender-updated` from an `ArtifactReboot` script, systemd kills the whole control group, including the script. This is fine in itself, but if the script happens to terminate before Mender does, then it will be recorded as an error, and the Mender will start on its error path. What happens afterwards depends on how far it gets before it is also killed. Usually it will not get further than executing the first `ArtifactReboot_Error` script, but it could potentially go all the way to a rollback. Either of those is wrong. The issue won't affect users of `rootfs-image`, since it uses `NeedsArtifactReboot=Automatic`, which doesn't call the update module's `ArtifactReboot`, but it can affect other means of running `ArtifactReboot`, such as restarting it with systemctl after a package upgrade. The best way to mitigate this is to make sure the script survives longer than Mender. This can be done in the script itself with a shell `trap` or similar, since systemd sends SIGTERM first. But in order to make this less surprising for users, switch systemd to kill the client first in all cases, leaving scripts to be killed only if the termination times out and it has to resort to SIGKILL. This started appearing with Yocto scarthgap, and why it has appeared now is anyone's guess, it could be multiple reasons: * Exact killing pattern of systemd might have changed slightly. * The new kernel might kill processes in the same control group slightly differently. Whatever the reason, it causes the script to sometimes terminate before Mender, causing the issue. Changelog: Fix systemd race condition when restarting mender from `ArtifactReboot` script. The symptom would be an error message like: ``` Process returned non-zero exit status: ArtifactReboot: Process exited with status 15 ``` And the `ArtifactReboot_Error` state scripts would be executed, even though they should not. Ticket: None Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 5e553cf)
@kacf, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline". my commands and optionsYou can trigger a pipeline on multiple prs with:
You can start a fast pipeline, disabling full integration tests with:
You can trigger GitHub->GitLab branch sync with:
You can cherry pick to a given branch or branches with:
|
Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 06d08c0)
Merging these commits will result in the following changelog entries: Changelogsmender (4.0.x)New changes in mender since 4.0.x: Bug Fixes
|
Spurious failure. Merging. |
No description provided.