Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.0.x cherry-picks #1655

Merged
merged 4 commits into from
Aug 12, 2024
Merged

4.0.x cherry-picks #1655

merged 4 commits into from
Aug 12, 2024

Conversation

kacf
Copy link
Member

@kacf kacf commented Aug 12, 2024

No description provided.

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
(cherry picked from commit 660eeb8)
The problem is that once we have entered the `sync_leave_download`
internal state, a deployment has started, and a deployment always has
to be ended, to properly clean up. This is fine when
`sync_leave_download` is successful and proceeds into subsequent
download states, which always end with the
`end_of_deployment_state`. But if `sync_leave_download` fails, then it
goes directly to `sync_error`, which does not know about the
deployment, and does not go to `end_of_deployment_state`.

Fix this by creating a dedicated `sync_error_download` state.

This bug happened only occasionally in the test
`test_state_scripts[Corrupted_script_version_in_etc-test_set11]` in
the integration repository, typically under load. The reason it
happened only occasionally is that the test is programmed to fail in
`Sync_Leave` only once, and inventory submission is always run
first. So by the time it got to the deployment it was always returning
success. But under load it could happen that the inventory submission
failed, which would run `Sync_Error` instead and skip `Sync_Leave`,
leaving it around for the upcoming deployment, where the bug would
then occur.

Testing this in unit tests requires supporting more than one
deployment run, which requires an extra member in the exit state. In
the spirit of trying to limit space requirements for embedded system,
I've made this member, which is only used for tests, debug-only.

Changelog: Fix crash when `Sync_Leave` returns error during a
deployment. The error message would be:
```
State machine event DeploymentStarted was not handled by any transition
```
and would happen on the next deployment following the `Sync_Leave`
error. With a long polling interval, this could cause the bug to be
latent for quite a while.

Ticket: MEN-7379

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
(cherry picked from commit 949a7a9)
Although somewhat rare, it seems to have more than 50% chance of
happening at least once in one of the state script tests of an
integration test run. What's happening is that when running `systemctl
restart mender-updated` from an `ArtifactReboot` script, systemd kills
the whole control group, including the script. This is fine in itself,
but if the script happens to terminate before Mender does, then it
will be recorded as an error, and the Mender will start on its error
path. What happens afterwards depends on how far it gets before it is
also killed. Usually it will not get further than executing the first
`ArtifactReboot_Error` script, but it could potentially go all the way
to a rollback. Either of those is wrong.

The issue won't affect users of `rootfs-image`, since it uses
`NeedsArtifactReboot=Automatic`, which doesn't call the update
module's `ArtifactReboot`, but it can affect other means of running
`ArtifactReboot`, such as restarting it with systemctl after a package
upgrade.

The best way to mitigate this is to make sure the script survives
longer than Mender. This can be done in the script itself with a shell
`trap` or similar, since systemd sends SIGTERM first. But in order to
make this less surprising for users, switch systemd to kill the client
first in all cases, leaving scripts to be killed only if the
termination times out and it has to resort to SIGKILL.

This started appearing with Yocto scarthgap, and why it has appeared
now is anyone's guess, it could be multiple reasons:
* Exact killing pattern of systemd might have changed slightly.
* The new kernel might kill processes in the same control group
  slightly differently.
Whatever the reason, it causes the script to sometimes terminate
before Mender, causing the issue.

Changelog: Fix systemd race condition when restarting mender from
`ArtifactReboot` script. The symptom would be an error message like:
```
Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
```
And the `ArtifactReboot_Error` state scripts would be executed, even
though they should not.

Ticket: None

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
(cherry picked from commit 5e553cf)
@mender-test-bot
Copy link

@kacf, Let me know if you want to start the integration pipeline by mentioning me and the command "start pipeline".


my commands and options

You can trigger a pipeline on multiple prs with:

  • mentioning me and start pipeline --pr mender/127 --pr mender-connect/255

You can start a fast pipeline, disabling full integration tests with:

  • mentioning me and start pipeline --fast

You can trigger GitHub->GitLab branch sync with:

  • mentioning me and sync

You can cherry pick to a given branch or branches with:

  • mentioning me and:
 cherry-pick to:
 * 1.0.x
 * 2.0.x

Signed-off-by: Kristian Amlie <kristian.amlie@northern.tech>
(cherry picked from commit 06d08c0)
@mender-test-bot
Copy link

mender-test-bot commented Aug 12, 2024

Merging these commits will result in the following changelog entries:

Changelogs

mender (4.0.x)

New changes in mender since 4.0.x:

Bug Fixes
  • Fix crash when Sync_Leave returns error during a
    deployment. The error message would be:
    State machine event DeploymentStarted was not handled by any transition
    
    and would happen on the next deployment following the Sync_Leave
    error. With a long polling interval, this could cause the bug to be
    latent for quite a while.
    (MEN-7379)
  • Fix systemd race condition when restarting mender from
    ArtifactReboot script. The symptom would be an error message like:
    Process returned non-zero exit status: ArtifactReboot: Process exited with status 15
    
    And the ArtifactReboot_Error state scripts would be executed, even
    though they should not.

@kacf
Copy link
Member Author

kacf commented Aug 12, 2024

Spurious failure. Merging.

@kacf kacf merged commit 017a9d6 into mendersoftware:4.0.x Aug 12, 2024
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants