fix: Fix crash when Sync_Leave returns error during a deployment. #1653

kacf · 2024-08-06T11:30:36Z

The problem is that once we have entered the sync_leave_download
internal state, a deployment has started, and a deployment always has
to be ended, to properly clean up. This is fine when
sync_leave_download is successful and proceeds into subsequent
download states, which always end with the
end_of_deployment_state. But if sync_leave_download fails, then it
goes directly to sync_error, which does not know about the
deployment, and does not go to end_of_deployment_state.

Fix this by creating a dedicated sync_error_download state.

This bug happened only occasionally in the test
test_state_scripts[Corrupted_script_version_in_etc-test_set11] in
the integration repository, typically under load. The reason it
happened only occasionally is that the test is programmed to fail in
Sync_Leave only once, and inventory submission is always run
first. So by the time it got to the deployment it was always returning
success. But under load it could happen that the inventory submission
failed, which would run Sync_Error instead and skip Sync_Leave,
leaving it around for the upcoming deployment, where the bug would
then occur.

Testing this in unit tests requires supporting more than one
deployment run, which requires an extra member in the exit state. In
the spirit of trying to limit space requirements for embedded system,
I've made this member, which is only used for tests, debug-only.

Changelog: Fix crash when Sync_Leave returns error during a
deployment. The error message would be:

State machine event DeploymentStarted was not handled by any transition

and would happen on the next deployment following the Sync_Leave
error. With a long polling interval, this could cause the bug to be
latent for quite a while.

Ticket: MEN-7379

Signed-off-by: Kristian Amlie kristian.amlie@northern.tech

mender-test-bot · 2024-08-06T11:31:09Z

@kacf, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline".

my commands and options

You can trigger a pipeline on multiple prs with:

mentioning me and start pipeline --pr mender/127 --pr mender-connect/255

You can start a fast pipeline, disabling full integration tests with:

mentioning me and start pipeline --fast

You can trigger GitHub->GitLab branch sync with:

mentioning me and sync

You can cherry pick to a given branch or branches with:

mentioning me and:

 cherry-pick to:
 * 1.0.x
 * 2.0.x

kacf · 2024-08-06T11:53:59Z

Being tested together with mendersoftware/integration#2640.

lluiscampos · 2024-08-07T06:18:54Z

@mender-test-bot start pipeline --pr integration/2640

mender-test-bot · 2024-08-07T06:19:28Z

Hello 😺 I created a pipeline for you here: Pipeline-1403668934

Build Configuration Matrix

Key	Value
AUDITLOGS_REV	master
BUILD_BEAGLEBONEBLACK	true
BUILD_CLIENT	true
BUILD_QEMUX86_64_BIOS_GRUB	true
BUILD_QEMUX86_64_BIOS_GRUB_GPT	true
BUILD_QEMUX86_64_UEFI_GRUB	true
BUILD_VEXPRESS_QEMU	true
BUILD_VEXPRESS_QEMU_FLASH	true
BUILD_VEXPRESS_QEMU_UBOOT_UEFI_GRUB	true
CREATE_ARTIFACT_WORKER_REV	master
DEPLOYMENTS_ENTERPRISE_REV	master
DEPLOYMENTS_REV	master
DEVICEAUTH_ENTERPRISE_REV	master
DEVICEAUTH_REV	master
DEVICECONFIG_REV	master
DEVICECONNECT_REV	master
DEVICEMONITOR_REV	master
GENERATE_DELTA_WORKER_REV	master
GUI_REV	master
INTEGRATION_REV	pull/2640/head
INVENTORY_ENTERPRISE_REV	master
INVENTORY_REV	master
IOT_MANAGER_REV	master
MENDER_ARTIFACT_REV	master
MENDER_BINARY_DELTA_REV	master
MENDER_CLI_REV	master
MENDER_CONFIGURE_MODULE_REV	master
MENDER_CONNECT_REV	master
MENDER_CONVERT_REV	master
MENDER_GATEWAY_REV	master
MENDER_REV	pull/1653/head
MENDER_SETUP_REV	master
MENDER_SNAPSHOT_REV	master
MONITOR_CLIENT_REV	master
RUN_BACKEND_INTEGRATION_TESTS	true
RUN_INTEGRATION_TESTS	true
TENANTADM_REV	master
TEST_QEMUX86_64_BIOS_GRUB	true
TEST_QEMUX86_64_BIOS_GRUB_GPT	true
TEST_QEMUX86_64_UEFI_GRUB	true
TEST_VEXPRESS_QEMU	true
TEST_VEXPRESS_QEMU_FLASH	true
TEST_VEXPRESS_QEMU_UBOOT_UEFI_GRUB	true
USERADM_ENTERPRISE_REV	master
USERADM_REV	master
WORKFLOWS_ENTERPRISE_REV	master
WORKFLOWS_REV	master

lluiscampos

Splendid. Thanks.

This particular bug would have taken me very long to figure it out.

src/mender-update/daemon/states.cpp

lluiscampos · 2024-08-07T06:29:44Z

Being tested together with mendersoftware/integration#2640.

I started the build again (using the bot) to run the pipeline with scarthgap. Your local cache is likely outdated 😅

kacf · 2024-08-07T06:33:59Z

Being tested together with mendersoftware/integration#2640.

I started the build again (using the bot) to run the pipeline with scarthgap. Your local cache is likely outdated 😅

Thanks! I had a "doh!" moment as soon as I saw your message! 😄

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>

The problem is that once we have entered the `sync_leave_download` internal state, a deployment has started, and a deployment always has to be ended, to properly clean up. This is fine when `sync_leave_download` is successful and proceeds into subsequent download states, which always end with the `end_of_deployment_state`. But if `sync_leave_download` fails, then it goes directly to `sync_error`, which does not know about the deployment, and does not go to `end_of_deployment_state`. Fix this by creating a dedicated `sync_error_download` state. This bug happened only occasionally in the test `test_state_scripts[Corrupted_script_version_in_etc-test_set11]` in the integration repository, typically under load. The reason it happened only occasionally is that the test is programmed to fail in `Sync_Leave` only once, and inventory submission is always run first. So by the time it got to the deployment it was always returning success. But under load it could happen that the inventory submission failed, which would run `Sync_Error` instead and skip `Sync_Leave`, leaving it around for the upcoming deployment, where the bug would then occur. Testing this in unit tests requires supporting more than one deployment run, which requires an extra member in the exit state. In the spirit of trying to limit space requirements for embedded system, I've made this member, which is only used for tests, debug-only. Changelog: Fix crash when `Sync_Leave` returns error during a deployment. The error message would be: ``` State machine event DeploymentStarted was not handled by any transition ``` and would happen on the next deployment following the `Sync_Leave` error. With a long polling interval, this could cause the bug to be latent for quite a while. Ticket: MEN-7379 Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>

Although somewhat rare, it seems to have more than 50% chance of happening at least once in one of the state script tests of an integration test run. What's happening is that when running `systemctl restart mender-updated` from an `ArtifactReboot` script, systemd kills the whole control group, including the script. This is fine in itself, but if the script happens to terminate before Mender does, then it will be recorded as an error, and the Mender will start on its error path. What happens afterwards depends on how far it gets before it is also killed. Usually it will not get further than executing the first `ArtifactReboot_Error` script, but it could potentially go all the way to a rollback. Either of those is wrong. The issue won't affect users of `rootfs-image`, since it uses `NeedsArtifactReboot=Automatic`, which doesn't call the update module's `ArtifactReboot`, but it can affect other means of running `ArtifactReboot`, such as restarting it with systemctl after a package upgrade. The best way to mitigate this is to make sure the script survives longer than Mender. This can be done in the script itself with a shell `trap` or similar, since systemd sends SIGTERM first. But in order to make this less surprising for users, switch systemd to kill the client first in all cases, leaving scripts to be killed only if the termination times out and it has to resort to SIGKILL. This started appearing with Yocto scarthgap, and why it has appeared now is anyone's guess, it could be multiple reasons: * Exact killing pattern of systemd might have changed slightly. * The new kernel might kill processes in the same control group slightly differently. Whatever the reason, it causes the script to sometimes terminate before Mender, causing the issue. Changelog: Fix systemd race condition when restarting mender from `ArtifactReboot` script. The symptom would be an error message like: ``` Process returned non-zero exit status: ArtifactReboot: Process exited with status 15 ``` And the `ArtifactReboot_Error` state scripts would be executed, even though they should not. Ticket: None Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>

mender-test-bot · 2024-08-09T11:38:28Z

Merging these commits will result in the following changelog entries:

Changelogs

mender (sync_leave_crash)

New changes in mender since master:

Bug Fixes

Fix crash when Sync_Leave returns error during a
deployment. The error message would be:
```
State machine event DeploymentStarted was not handled by any transition
```
and would happen on the next deployment following the Sync_Leave
error. With a long polling interval, this could cause the bug to be
latent for quite a while.
(MEN-7379)
Fix systemd race condition when restarting mender from
ArtifactReboot script. The symptom would be an error message like:
```
Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
```
And the ArtifactReboot_Error state scripts would be executed, even
though they should not.

kacf · 2024-08-09T11:40:20Z

One new fix. This one is unrelated, but started showing up independently once I removed flaky.

lluiscampos

🤯

kacf · 2024-08-12T05:43:58Z

Only one known spurious failure. Merging.

mender-test-bot · 2024-08-12T05:44:40Z

Hello 😺 This PR contains changelog entries. Please, verify the need of backporting it to the following release branches:
4.0.x (release 3.7.x)
3.5.x (release 3.6.x)

kacf force-pushed the sync_leave_crash branch from 96ebba4 to 22f1b79 Compare August 6, 2024 11:33

kacf requested review from lluiscampos and danielskinstad August 6, 2024 11:53

lluiscampos approved these changes Aug 7, 2024

View reviewed changes

src/mender-update/daemon/states.cpp Show resolved Hide resolved

kacf force-pushed the sync_leave_crash branch 2 times, most recently from db58e6b to 6a70ff1 Compare August 7, 2024 13:16

lluiscampos approved these changes Aug 8, 2024

View reviewed changes

kacf added 3 commits August 9, 2024 13:38

chore: Fix indentation.

660eeb8

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>

kacf force-pushed the sync_leave_crash branch from cd7d345 to 5e553cf Compare August 9, 2024 11:38

kacf requested a review from lluiscampos August 9, 2024 11:40

lluiscampos approved these changes Aug 9, 2024

View reviewed changes

kacf merged commit eae7540 into mendersoftware:master Aug 12, 2024
17 of 19 checks passed

kacf deleted the sync_leave_crash branch August 12, 2024 05:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix crash when Sync_Leave returns error during a deployment. #1653

fix: Fix crash when Sync_Leave returns error during a deployment. #1653

kacf commented Aug 6, 2024

mender-test-bot commented Aug 6, 2024

kacf commented Aug 6, 2024

lluiscampos commented Aug 7, 2024

mender-test-bot commented Aug 7, 2024

lluiscampos left a comment

lluiscampos commented Aug 7, 2024

kacf commented Aug 7, 2024

mender-test-bot commented Aug 9, 2024 •

edited by jira bot

Loading

kacf commented Aug 9, 2024

lluiscampos left a comment

kacf commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024

fix: Fix crash when Sync_Leave returns error during a deployment. #1653

fix: Fix crash when Sync_Leave returns error during a deployment. #1653

Conversation

kacf commented Aug 6, 2024

mender-test-bot commented Aug 6, 2024

kacf commented Aug 6, 2024

lluiscampos commented Aug 7, 2024

mender-test-bot commented Aug 7, 2024

lluiscampos left a comment

Choose a reason for hiding this comment

lluiscampos commented Aug 7, 2024

kacf commented Aug 7, 2024

mender-test-bot commented Aug 9, 2024 • edited by jira bot Loading

Changelogs

mender (sync_leave_crash)

Bug Fixes

kacf commented Aug 9, 2024

lluiscampos left a comment

Choose a reason for hiding this comment

kacf commented Aug 12, 2024

mender-test-bot commented Aug 12, 2024

mender-test-bot commented Aug 9, 2024 •

edited by jira bot

Loading