Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix crash when Sync_Leave returns error during a deployment. #1653

Merged
merged 3 commits into from
Aug 12, 2024

Conversation

kacf
Copy link
Member

@kacf kacf commented Aug 6, 2024

The problem is that once we have entered the sync_leave_download
internal state, a deployment has started, and a deployment always has
to be ended, to properly clean up. This is fine when
sync_leave_download is successful and proceeds into subsequent
download states, which always end with the
end_of_deployment_state. But if sync_leave_download fails, then it
goes directly to sync_error, which does not know about the
deployment, and does not go to end_of_deployment_state.

Fix this by creating a dedicated sync_error_download state.

This bug happened only occasionally in the test
test_state_scripts[Corrupted_script_version_in_etc-test_set11] in
the integration repository, typically under load. The reason it
happened only occasionally is that the test is programmed to fail in
Sync_Leave only once, and inventory submission is always run
first. So by the time it got to the deployment it was always returning
success. But under load it could happen that the inventory submission
failed, which would run Sync_Error instead and skip Sync_Leave,
leaving it around for the upcoming deployment, where the bug would
then occur.

Testing this in unit tests requires supporting more than one
deployment run, which requires an extra member in the exit state. In
the spirit of trying to limit space requirements for embedded system,
I've made this member, which is only used for tests, debug-only.

Changelog: Fix crash when Sync_Leave returns error during a
deployment. The error message would be:

State machine event DeploymentStarted was not handled by any transition

and would happen on the next deployment following the Sync_Leave
error. With a long polling interval, this could cause the bug to be
latent for quite a while.

Ticket: MEN-7379

Signed-off-by: Kristian Amlie kristian.amlie@northern.tech

@mender-test-bot
Copy link

@kacf, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline".


my commands and options

You can trigger a pipeline on multiple prs with:

  • mentioning me and start pipeline --pr mender/127 --pr mender-connect/255

You can start a fast pipeline, disabling full integration tests with:

  • mentioning me and start pipeline --fast

You can trigger GitHub->GitLab branch sync with:

  • mentioning me and sync

You can cherry pick to a given branch or branches with:

  • mentioning me and:
 cherry-pick to:
 * 1.0.x
 * 2.0.x

@kacf
Copy link
Member Author

kacf commented Aug 6, 2024

Being tested together with mendersoftware/integration#2640.

@lluiscampos
Copy link
Contributor

@mender-test-bot start pipeline --pr integration/2640

@mender-test-bot
Copy link

Hello 😺 I created a pipeline for you here: Pipeline-1403668934

Build Configuration Matrix

Key Value
AUDITLOGS_REV master
BUILD_BEAGLEBONEBLACK true
BUILD_CLIENT true
BUILD_QEMUX86_64_BIOS_GRUB true
BUILD_QEMUX86_64_BIOS_GRUB_GPT true
BUILD_QEMUX86_64_UEFI_GRUB true
BUILD_VEXPRESS_QEMU true
BUILD_VEXPRESS_QEMU_FLASH true
BUILD_VEXPRESS_QEMU_UBOOT_UEFI_GRUB true
CREATE_ARTIFACT_WORKER_REV master
DEPLOYMENTS_ENTERPRISE_REV master
DEPLOYMENTS_REV master
DEVICEAUTH_ENTERPRISE_REV master
DEVICEAUTH_REV master
DEVICECONFIG_REV master
DEVICECONNECT_REV master
DEVICEMONITOR_REV master
GENERATE_DELTA_WORKER_REV master
GUI_REV master
INTEGRATION_REV pull/2640/head
INVENTORY_ENTERPRISE_REV master
INVENTORY_REV master
IOT_MANAGER_REV master
MENDER_ARTIFACT_REV master
MENDER_BINARY_DELTA_REV master
MENDER_CLI_REV master
MENDER_CONFIGURE_MODULE_REV master
MENDER_CONNECT_REV master
MENDER_CONVERT_REV master
MENDER_GATEWAY_REV master
MENDER_REV pull/1653/head
MENDER_SETUP_REV master
MENDER_SNAPSHOT_REV master
MONITOR_CLIENT_REV master
RUN_BACKEND_INTEGRATION_TESTS true
RUN_INTEGRATION_TESTS true
TENANTADM_REV master
TEST_QEMUX86_64_BIOS_GRUB true
TEST_QEMUX86_64_BIOS_GRUB_GPT true
TEST_QEMUX86_64_UEFI_GRUB true
TEST_VEXPRESS_QEMU true
TEST_VEXPRESS_QEMU_FLASH true
TEST_VEXPRESS_QEMU_UBOOT_UEFI_GRUB true
USERADM_ENTERPRISE_REV master
USERADM_REV master
WORKFLOWS_ENTERPRISE_REV master
WORKFLOWS_REV master

Copy link
Contributor

@lluiscampos lluiscampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splendid. Thanks.

This particular bug would have taken me very long to figure it out.

src/mender-update/daemon/states.cpp Show resolved Hide resolved
@lluiscampos
Copy link
Contributor

Being tested together with mendersoftware/integration#2640.

I started the build again (using the bot) to run the pipeline with scarthgap. Your local cache is likely outdated 😅

@kacf
Copy link
Member Author

kacf commented Aug 7, 2024

Being tested together with mendersoftware/integration#2640.

I started the build again (using the bot) to run the pipeline with scarthgap. Your local cache is likely outdated 😅

Thanks! I had a "doh!" moment as soon as I saw your message! 😄

@kacf kacf force-pushed the sync_leave_crash branch 2 times, most recently from db58e6b to 6a70ff1 Compare August 7, 2024 13:16
Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
The problem is that once we have entered the `sync_leave_download`
internal state, a deployment has started, and a deployment always has
to be ended, to properly clean up. This is fine when
`sync_leave_download` is successful and proceeds into subsequent
download states, which always end with the
`end_of_deployment_state`. But if `sync_leave_download` fails, then it
goes directly to `sync_error`, which does not know about the
deployment, and does not go to `end_of_deployment_state`.

Fix this by creating a dedicated `sync_error_download` state.

This bug happened only occasionally in the test
`test_state_scripts[Corrupted_script_version_in_etc-test_set11]` in
the integration repository, typically under load. The reason it
happened only occasionally is that the test is programmed to fail in
`Sync_Leave` only once, and inventory submission is always run
first. So by the time it got to the deployment it was always returning
success. But under load it could happen that the inventory submission
failed, which would run `Sync_Error` instead and skip `Sync_Leave`,
leaving it around for the upcoming deployment, where the bug would
then occur.

Testing this in unit tests requires supporting more than one
deployment run, which requires an extra member in the exit state. In
the spirit of trying to limit space requirements for embedded system,
I've made this member, which is only used for tests, debug-only.

Changelog: Fix crash when `Sync_Leave` returns error during a
deployment. The error message would be:
```
State machine event DeploymentStarted was not handled by any transition
```
and would happen on the next deployment following the `Sync_Leave`
error. With a long polling interval, this could cause the bug to be
latent for quite a while.

Ticket: MEN-7379

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
Although somewhat rare, it seems to have more than 50% chance of
happening at least once in one of the state script tests of an
integration test run. What's happening is that when running `systemctl
restart mender-updated` from an `ArtifactReboot` script, systemd kills
the whole control group, including the script. This is fine in itself,
but if the script happens to terminate before Mender does, then it
will be recorded as an error, and the Mender will start on its error
path. What happens afterwards depends on how far it gets before it is
also killed. Usually it will not get further than executing the first
`ArtifactReboot_Error` script, but it could potentially go all the way
to a rollback. Either of those is wrong.

The issue won't affect users of `rootfs-image`, since it uses
`NeedsArtifactReboot=Automatic`, which doesn't call the update
module's `ArtifactReboot`, but it can affect other means of running
`ArtifactReboot`, such as restarting it with systemctl after a package
upgrade.

The best way to mitigate this is to make sure the script survives
longer than Mender. This can be done in the script itself with a shell
`trap` or similar, since systemd sends SIGTERM first. But in order to
make this less surprising for users, switch systemd to kill the client
first in all cases, leaving scripts to be killed only if the
termination times out and it has to resort to SIGKILL.

This started appearing with Yocto scarthgap, and why it has appeared
now is anyone's guess, it could be multiple reasons:
* Exact killing pattern of systemd might have changed slightly.
* The new kernel might kill processes in the same control group
  slightly differently.
Whatever the reason, it causes the script to sometimes terminate
before Mender, causing the issue.

Changelog: Fix systemd race condition when restarting mender from
`ArtifactReboot` script. The symptom would be an error message like:
```
Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
```
And the `ArtifactReboot_Error` state scripts would be executed, even
though they should not.

Ticket: None

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
@mender-test-bot
Copy link

mender-test-bot commented Aug 9, 2024

Merging these commits will result in the following changelog entries:

Changelogs

mender (sync_leave_crash)

New changes in mender since master:

Bug Fixes
  • Fix crash when Sync_Leave returns error during a
    deployment. The error message would be:
    State machine event DeploymentStarted was not handled by any transition
    
    and would happen on the next deployment following the Sync_Leave
    error. With a long polling interval, this could cause the bug to be
    latent for quite a while.
    (MEN-7379)
  • Fix systemd race condition when restarting mender from
    ArtifactReboot script. The symptom would be an error message like:
    Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
    
    And the ArtifactReboot_Error state scripts would be executed, even
    though they should not.

@kacf
Copy link
Member Author

kacf commented Aug 9, 2024

One new fix. This one is unrelated, but started showing up independently once I removed flaky.

@kacf kacf requested a review from lluiscampos August 9, 2024 11:40
Copy link
Contributor

@lluiscampos lluiscampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

@kacf
Copy link
Member Author

kacf commented Aug 12, 2024

Only one known spurious failure. Merging.

@kacf kacf merged commit eae7540 into mendersoftware:master Aug 12, 2024
17 of 19 checks passed
@kacf kacf deleted the sync_leave_crash branch August 12, 2024 05:44
@mender-test-bot
Copy link

Hello 😺 This PR contains changelog entries. Please, verify the need of backporting it to the following release branches:
4.0.x (release 3.7.x)
3.5.x (release 3.6.x)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants