-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix crash when Sync_Leave returns error during a deployment. #1653
Conversation
@kacf, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline". my commands and optionsYou can trigger a pipeline on multiple prs with:
You can start a fast pipeline, disabling full integration tests with:
You can trigger GitHub->GitLab branch sync with:
You can cherry pick to a given branch or branches with:
|
Being tested together with mendersoftware/integration#2640. |
@mender-test-bot start pipeline --pr integration/2640 |
Hello 😺 I created a pipeline for you here: Pipeline-1403668934 Build Configuration Matrix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splendid. Thanks.
This particular bug would have taken me very long to figure it out.
I started the build again (using the bot) to run the pipeline with |
Thanks! I had a "doh!" moment as soon as I saw your message! 😄 |
db58e6b
to
6a70ff1
Compare
Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
The problem is that once we have entered the `sync_leave_download` internal state, a deployment has started, and a deployment always has to be ended, to properly clean up. This is fine when `sync_leave_download` is successful and proceeds into subsequent download states, which always end with the `end_of_deployment_state`. But if `sync_leave_download` fails, then it goes directly to `sync_error`, which does not know about the deployment, and does not go to `end_of_deployment_state`. Fix this by creating a dedicated `sync_error_download` state. This bug happened only occasionally in the test `test_state_scripts[Corrupted_script_version_in_etc-test_set11]` in the integration repository, typically under load. The reason it happened only occasionally is that the test is programmed to fail in `Sync_Leave` only once, and inventory submission is always run first. So by the time it got to the deployment it was always returning success. But under load it could happen that the inventory submission failed, which would run `Sync_Error` instead and skip `Sync_Leave`, leaving it around for the upcoming deployment, where the bug would then occur. Testing this in unit tests requires supporting more than one deployment run, which requires an extra member in the exit state. In the spirit of trying to limit space requirements for embedded system, I've made this member, which is only used for tests, debug-only. Changelog: Fix crash when `Sync_Leave` returns error during a deployment. The error message would be: ``` State machine event DeploymentStarted was not handled by any transition ``` and would happen on the next deployment following the `Sync_Leave` error. With a long polling interval, this could cause the bug to be latent for quite a while. Ticket: MEN-7379 Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
Although somewhat rare, it seems to have more than 50% chance of happening at least once in one of the state script tests of an integration test run. What's happening is that when running `systemctl restart mender-updated` from an `ArtifactReboot` script, systemd kills the whole control group, including the script. This is fine in itself, but if the script happens to terminate before Mender does, then it will be recorded as an error, and the Mender will start on its error path. What happens afterwards depends on how far it gets before it is also killed. Usually it will not get further than executing the first `ArtifactReboot_Error` script, but it could potentially go all the way to a rollback. Either of those is wrong. The issue won't affect users of `rootfs-image`, since it uses `NeedsArtifactReboot=Automatic`, which doesn't call the update module's `ArtifactReboot`, but it can affect other means of running `ArtifactReboot`, such as restarting it with systemctl after a package upgrade. The best way to mitigate this is to make sure the script survives longer than Mender. This can be done in the script itself with a shell `trap` or similar, since systemd sends SIGTERM first. But in order to make this less surprising for users, switch systemd to kill the client first in all cases, leaving scripts to be killed only if the termination times out and it has to resort to SIGKILL. This started appearing with Yocto scarthgap, and why it has appeared now is anyone's guess, it could be multiple reasons: * Exact killing pattern of systemd might have changed slightly. * The new kernel might kill processes in the same control group slightly differently. Whatever the reason, it causes the script to sometimes terminate before Mender, causing the issue. Changelog: Fix systemd race condition when restarting mender from `ArtifactReboot` script. The symptom would be an error message like: ``` Process returned non-zero exit status: ArtifactReboot: Process exited with status 15 ``` And the `ArtifactReboot_Error` state scripts would be executed, even though they should not. Ticket: None Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
Merging these commits will result in the following changelog entries: Changelogsmender (sync_leave_crash)New changes in mender since master: Bug Fixes
|
One new fix. This one is unrelated, but started showing up independently once I removed flaky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
Only one known spurious failure. Merging. |
Hello 😺 This PR contains changelog entries. Please, verify the need of backporting it to the following release branches: |
The problem is that once we have entered the
sync_leave_download
internal state, a deployment has started, and a deployment always has
to be ended, to properly clean up. This is fine when
sync_leave_download
is successful and proceeds into subsequentdownload states, which always end with the
end_of_deployment_state
. But ifsync_leave_download
fails, then itgoes directly to
sync_error
, which does not know about thedeployment, and does not go to
end_of_deployment_state
.Fix this by creating a dedicated
sync_error_download
state.This bug happened only occasionally in the test
test_state_scripts[Corrupted_script_version_in_etc-test_set11]
inthe integration repository, typically under load. The reason it
happened only occasionally is that the test is programmed to fail in
Sync_Leave
only once, and inventory submission is always runfirst. So by the time it got to the deployment it was always returning
success. But under load it could happen that the inventory submission
failed, which would run
Sync_Error
instead and skipSync_Leave
,leaving it around for the upcoming deployment, where the bug would
then occur.
Testing this in unit tests requires supporting more than one
deployment run, which requires an extra member in the exit state. In
the spirit of trying to limit space requirements for embedded system,
I've made this member, which is only used for tests, debug-only.
Changelog: Fix crash when
Sync_Leave
returns error during adeployment. The error message would be:
and would happen on the next deployment following the
Sync_Leave
error. With a long polling interval, this could cause the bug to be
latent for quite a while.
Ticket: MEN-7379
Signed-off-by: Kristian Amlie kristian.amlie@northern.tech