4.0.x cherry-picks #1655

kacf · 2024-08-12T05:50:19Z

No description provided.

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 660eeb8)

The problem is that once we have entered the `sync_leave_download` internal state, a deployment has started, and a deployment always has to be ended, to properly clean up. This is fine when `sync_leave_download` is successful and proceeds into subsequent download states, which always end with the `end_of_deployment_state`. But if `sync_leave_download` fails, then it goes directly to `sync_error`, which does not know about the deployment, and does not go to `end_of_deployment_state`. Fix this by creating a dedicated `sync_error_download` state. This bug happened only occasionally in the test `test_state_scripts[Corrupted_script_version_in_etc-test_set11]` in the integration repository, typically under load. The reason it happened only occasionally is that the test is programmed to fail in `Sync_Leave` only once, and inventory submission is always run first. So by the time it got to the deployment it was always returning success. But under load it could happen that the inventory submission failed, which would run `Sync_Error` instead and skip `Sync_Leave`, leaving it around for the upcoming deployment, where the bug would then occur. Testing this in unit tests requires supporting more than one deployment run, which requires an extra member in the exit state. In the spirit of trying to limit space requirements for embedded system, I've made this member, which is only used for tests, debug-only. Changelog: Fix crash when `Sync_Leave` returns error during a deployment. The error message would be: ``` State machine event DeploymentStarted was not handled by any transition ``` and would happen on the next deployment following the `Sync_Leave` error. With a long polling interval, this could cause the bug to be latent for quite a while. Ticket: MEN-7379 Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 949a7a9)

Although somewhat rare, it seems to have more than 50% chance of happening at least once in one of the state script tests of an integration test run. What's happening is that when running `systemctl restart mender-updated` from an `ArtifactReboot` script, systemd kills the whole control group, including the script. This is fine in itself, but if the script happens to terminate before Mender does, then it will be recorded as an error, and the Mender will start on its error path. What happens afterwards depends on how far it gets before it is also killed. Usually it will not get further than executing the first `ArtifactReboot_Error` script, but it could potentially go all the way to a rollback. Either of those is wrong. The issue won't affect users of `rootfs-image`, since it uses `NeedsArtifactReboot=Automatic`, which doesn't call the update module's `ArtifactReboot`, but it can affect other means of running `ArtifactReboot`, such as restarting it with systemctl after a package upgrade. The best way to mitigate this is to make sure the script survives longer than Mender. This can be done in the script itself with a shell `trap` or similar, since systemd sends SIGTERM first. But in order to make this less surprising for users, switch systemd to kill the client first in all cases, leaving scripts to be killed only if the termination times out and it has to resort to SIGKILL. This started appearing with Yocto scarthgap, and why it has appeared now is anyone's guess, it could be multiple reasons: * Exact killing pattern of systemd might have changed slightly. * The new kernel might kill processes in the same control group slightly differently. Whatever the reason, it causes the script to sometimes terminate before Mender, causing the issue. Changelog: Fix systemd race condition when restarting mender from `ArtifactReboot` script. The symptom would be an error message like: ``` Process returned non-zero exit status: ArtifactReboot: Process exited with status 15 ``` And the `ArtifactReboot_Error` state scripts would be executed, even though they should not. Ticket: None Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 5e553cf)

mender-test-bot · 2024-08-12T05:50:50Z

@kacf, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline".

my commands and options

You can trigger a pipeline on multiple prs with:

mentioning me and start pipeline --pr mender/127 --pr mender-connect/255

You can start a fast pipeline, disabling full integration tests with:

mentioning me and start pipeline --fast

You can trigger GitHub->GitLab branch sync with:

mentioning me and sync

You can cherry pick to a given branch or branches with:

mentioning me and:

 cherry-pick to:
 * 1.0.x
 * 2.0.x

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 06d08c0)

mender-test-bot · 2024-08-12T05:58:59Z

Merging these commits will result in the following changelog entries:

Changelogs

mender (4.0.x)

New changes in mender since 4.0.x:

Bug Fixes

Fix crash when Sync_Leave returns error during a
deployment. The error message would be:
```
State machine event DeploymentStarted was not handled by any transition
```
and would happen on the next deployment following the Sync_Leave
error. With a long polling interval, this could cause the bug to be
latent for quite a while.
(MEN-7379)
Fix systemd race condition when restarting mender from
ArtifactReboot script. The symptom would be an error message like:
```
Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
```
And the ArtifactReboot_Error state scripts would be executed, even
though they should not.

kacf · 2024-08-12T09:55:16Z

Spurious failure. Merging.

kacf added 3 commits August 12, 2024 07:47

chore: Fix indentation.

5f11f75

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 660eeb8)

test: Follow format changes in mender-artifact/ffee20ed0f6930d.

a89913e

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech> (cherry picked from commit 06d08c0)

kacf merged commit 017a9d6 into mendersoftware:4.0.x Aug 12, 2024
17 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4.0.x cherry-picks #1655

4.0.x cherry-picks #1655

kacf commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024 •

edited by jira bot

Loading

kacf commented Aug 12, 2024

4.0.x cherry-picks #1655

4.0.x cherry-picks #1655

Conversation

kacf commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024 • edited by jira bot Loading

Changelogs

mender (4.0.x)

Bug Fixes

kacf commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024 •

edited by jira bot

Loading