Skip to content

Add systemd soft reboot functionality#4304

Closed
skycastlelily wants to merge 2 commits intomainfrom
systemctl-reboot
Closed

Add systemd soft reboot functionality#4304
skycastlelily wants to merge 2 commits intomainfrom
systemctl-reboot

Conversation

@skycastlelily
Copy link
Collaborator

@skycastlelily skycastlelily commented Nov 10, 2025

Fixes: #4298

I guess we should support systemctl soft-reboot for all possible plugins?

Because it could help our users to update all of userspace (when there's no kernel changes) efficiently, which means shorter reboot time.
And, we should not make it default.
"It definitely can't work to change all reboots to soft reboots as some use cases will want to change the kernel for example."

Pull Request Checklist

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • adjust plugin docstring
  • modify the json schema
  • mention the version
  • include a release note

@skycastlelily skycastlelily marked this pull request as draft November 10, 2025 15:16
@LecrisUT
Copy link
Contributor

Some more info on why systemd soft reboot is needed is useful either in the comments or the PR description. Can't judge right now, if we should make that the default whenever systemd is present, or a feature specific to systemd or something else entirely

@LecrisUT LecrisUT self-assigned this Nov 11, 2025
@github-project-automation github-project-automation bot moved this to backlog in planning Nov 11, 2025
@LecrisUT LecrisUT moved this from backlog to implement in planning Nov 11, 2025
@skycastlelily
Copy link
Collaborator Author

Hi @cgwalters, I'm working on this mr, it works with virtual plugin, but not bootc plugin, and I manually ssh into the bootc system created by bootc plugin, then run "systemctl soft-reboot", then I'm not able to login to that system anymore, do you have idea why, or what additional steps should I add, or any hints,guides?Thanks:)

@psss psss added the command | reboot Support for rebooting guests during `tmt run` and the `tmt-reboot` command label Nov 20, 2025
@happz
Copy link
Contributor

happz commented Nov 20, 2025

One preliminary comment: "soft reboot" already has a history and its own meaning, which is by no means the same as "systemd soft-reboot". Please, consider using different name/label/variables (e.g. soft_reboot="$3") that would make it clear that this mode is not the "soft reboot" as already known and implemented, i.e. software-induced reboot, e.g. via shutdown -r now or similar command. Paired with the "hard reboot", on the level of poweroff/poweron events. Great care must be taken to make it clear that "soft reboot" is one thing, and "systemd soft-reboot" is something else.

@cgwalters
Copy link
Contributor

Yes true, but OTOH I think it should be quite unusual and rare for code executed inside a guest to "reach out" to a hypervisor or control plane and do a physical reboot. That's the case that I think needs dedicated nomenclature.

I would probably rename the hard to physical in tmt.

But yes arguably too systemd's soft reboots should probably have been called an init reboot to be less ambiguous.

@cgwalters
Copy link
Contributor

Hi @cgwalters, I'm working on this mr, it works with virtual plugin, but not bootc plugin, and I manually ssh into the bootc system created by bootc plugin, then run "systemctl soft-reboot", then I'm not able to login to that system anymore, do you have idea why, or what additional steps should I add, or any hints,guides?Thanks:)

Offhand it works for me (I happened to be test in a bcvk libvirt run quay.io/almalinuxorg/almalinux-bootc:10.0 machine) but if you can show your tmt reproducer set up we could look.

My recommendation here is to be sure you have a console set up at least for debugging.

@happz
Copy link
Contributor

happz commented Nov 20, 2025

Yes true, but OTOH I think it should be quite unusual and rare for code executed inside a guest to "reach out" to a hypervisor or control plane and do a physical reboot.

Not necessarily the code running inside a guest, but for tmt this is a real situation. A guest may freeze, crash, e.g. thanks to various kernel torturing tests, tests causing kernel oops on purpose, and tmt does have tools to invoke "hard" reboot to restore the law and order. tmt does not care that much about how it's implemented, whether it's a magic of Beaker or libvirt & qemu, but it's called "hard reboot" in tmt codebase.

I would probably rename the hard to physical in tmt.

That would be possible.

But yes arguably too systemd's soft reboots should probably have been called an init reboot to be less ambiguous.

I think sticking to "systemd soft-reboot" term should be enough, it just needs to be consistent. My point was to double check changes to avoid variables like soft_reboot which are not about "soft reboot", but "systemd soft-reboot" instead.

@skycastlelily
Copy link
Collaborator Author

skycastlelily commented Nov 24, 2025

but if you can show your tmt reproducer set up we could look.

sure, here is the reproducer:

tmt run --skip finish --skip cleanup plan --name plans/bootc$
content of the plan :

summary: Basic smoke test
provision:
    how: bootc
    container-image: quay.io/fedora/fedora-bootc:43
execute:
  script: echo 'test'

Offhand it works for me (I happened to be test in a bcvk libvirt run quay.io/almalinuxorg/almalinux-bootc:10.0 machine) but if you can show your tmt reproducer set up we could look.

with quay.io/almalinuxorg/almalinux-bootc:10.0,it indeed work, though there is a failed service after systemctl soft-reboot:

 (dev) [lnie@ tmt]$ ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10029 -i /var/tmp/tmt/run-024/plans/bootc/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-024/ssh-sockets/127.0.0.1-10029-root.socket root@127.0.0.1 
Warning: Permanently added '[127.0.0.1]:10029' (ED25519) to the list of known hosts.
[root@default-0 ~]# systemctl soft-reboot
[root@default-0 ~]# Connection to 127.0.0.1 closed by remote host.
Connection to 127.0.0.1 closed.
(dev) [lnie@ tmt]$ ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10029 -i /var/tmp/tmt/run-024/plans/bootc/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-024/ssh-sockets/127.0.0.1-10029-root.socket root@127.0.0.1 
Warning: Permanently added '[127.0.0.1]:10029' (ED25519) to the list of known hosts.
Last login: Mon Nov 24 10:41:36 2025 from 10.0.2.2
[systemd]
Failed Units: 1
rpcbind.service
[root@default-0 ~]# systemctl status rpcbind.service
× rpcbind.service - RPC Bind
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; preset: enabled)
   Active: failed (Result: signal) since Mon 2025-11-24 10:41:56 UTC; 46s ago
 Duration: 2min 38.510s
Invocation: b7db4da5acc04012a32f3314ef7177f4
TriggeredBy: ● rpcbind.socket
     Docs: man:rpcbind(8)
  Process: 852 ExecStart=/usr/bin/rpcbind $RPCBIND_ARGS -w -f (code=killed, signal=KILL)
 Main PID: 852 (code=killed, signal=KILL)
 Mem peak: 2.4M
      CPU: 31ms

Nov 24 10:39:18 localhost systemd[1]: Starting rpcbind.service - RPC Bind...
Nov 24 10:39:18 localhost systemd[1]: Started rpcbind.service - RPC Bind.
[root@default-0 ~]# 

My recommendation here is to be sure you have a console set up at least for debugging.

here is part of the console output:

[  OK  ] Reached target local-fs-pre.target…Preparation for Local File Systems.         Starting systemd-udevd.service - R…ager for Device Events and Files...
[  OK  ] Started systemd-udevd.service - Ru…anager for Device Events and Files.
         Mounting boot.mount - /boot...
         Mounting var.mount - /var...
[  130.009798] XFS (vda3): Mounting V5 Filesystem 9a6f9939-ec4a-4c1f-a023-c0e51698b41c
[FAILED] Failed to mount var.mount - /var.
See 'systemctl status var.mount' for details.
[DEPEND] Dependency failed for cloud-init-m…rvice - Cloud-init: Single Process.
[DEPEND] Dependency failed for syste[  130.037685] XFS (vda3): Ending clean mount
md-homed.service - Home Area Manager.
[DEPEND] Dependency failed for systemd-psto…atform Persistent Storage Archival.
[DEPEND] Dependency failed for chronyd.service - NTP client/server.
[DEPEND] Dependency failed for raid-check.t…r - Weekly RAID setup health check.
[DEPEND] Dependency failed for fstrim.timer…used filesystem blocks once a week.
[DEPEND] Dependency failed for var-lib-nfs-…ipefs.mount - RPC Pipe File System.
[DEPEND] Dependency failed for rpc_pipefs.target.
[DEPEND] Dependency failed for rpc-gssd.ser… service for NFS client and server.
[DEPEND] Dependency failed for basic.target - Basic System.
[DEPEND] Dependency failed for multi-user.target - Multi-User System.
[DEPEND] Dependency failed for graphical.target - Graphical Interface.
[DEPEND] Dependency failed for systemd-logind.service - User Login Management.
[DEPEND] Dependency failed for systemd-upda…ecord System Boot/Shutdown in UTMP.
[DEPEND] Dependency failed for systemd-tpm2-setup.service - TPM SRK Setup.
[DEPEND] Dependency failed for systemd-rand…service - Load/Save OS Random Seed.
[DEPEND] Dependency failed for local-fs.target - Local File Systems.
[DEPEND] Dependency failed for selinux-auto…k the need to relabel after reboot.
[DEPEND] Dependency failed for systemd-jour…lush Journal to Persistent Storage.
[  OK  ] Mounted boot.mount - /boot.
[  OK  ] Stopped systemd-ask-password-conso…equests to Console Directory Watch.
[  OK  ] Stopped systemd-ask-password-wall.…d Requests to Wall Directory Watch.
[  OK  ] Reached target paths.target - Path Units.
[  OK  ] Reached target timers.target - Timer Units.
[  OK  ] Reached target ssh-access.target - SSH Access Available.
         Mounting boot-efi.mount - /boot/efi...
[  OK  ] Reached target cloud-init.target - Cloud-init target.
[  OK  ] Reached target nfs-client.target - NFS client services.
[  OK  ] Reached target remote-fs-pre.targe…reparation for Remote File Systems.
[  OK  ] Reached target remote-integrityset…Remote Integrity Protected Volumes.
[  OK  ] Reached target remote-veritysetup.… - Remote Verity Protected Volumes.
         Starting ostree-remount.service - OSTree Remount OS/ Bind Mounts...
[  OK  ] Reached target getty.target - Login Prompts.
         Starting cloud-init-local.service …-init: Local Stage (pre-network)...
[  OK  ] Reached target remote-cryptsetup.target - Remote Encrypted Volumes.
[  OK  ] Reached target remote-fs.target - Remote File Systems.
         Starting systemd-userdb-load-crede…r/group Records from Credentials...
[  OK  ] Reached target sockets.target - Socket Units.
[  130.211783] FAT-fs (vda2): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[  OK  ] Reached target bootc-status-update… - Bootc status trigger state sync.
[  OK  ] Started emergency.service - Emergency Shell.
[  OK  ] Reached target emergency.target - Emergency Mode.
         Starting systemd-binfmt.service - Set Up Additional Binary Formats...
[  OK  ] Mounted boot-efi.mount - /boot/efi.
[  OK  ] Finished ostree-remount.service - OSTree Remount OS/ Bind Mounts.
[  OK  ] Finished systemd-userdb-load-crede…ser/group Records from Credentials.
[  OK  ] Reached target nss-user-lookup.target - User and Group Name Lookups.
[  OK  ] Stopped target ssh-access.target - SSH Access Available.
         Mounting proc-sys-fs-binfmt_misc.m…cutable File Formats File System...
         Starting systemd-tmpfiles-setup.se…ate System Files and Directories...
[  130.008035] sh[1385]: nc: /run/cloud-init/share/local.sock: Connection refused
[  OK  ] Mounted proc-sys-fs-binfmt_misc.mo…xecutable File Formats File System.
[  OK  ] Finished cloud-init-local.service …ud-init: Local Stage (pre-network).
[  OK  ] Finished systemd-binfmt.service - Set Up Additional Binary Formats.
[  OK  ] Reached target cloud-config.target - Cloud-config availability.
[  OK  ] Reached target network-pre.target - Preparation for Network.
[  OK  ] Finished systemd-tmpfiles-setup.se…reate System Files and Directories.
         Starting systemd-oomd.service - Us…space Out-Of-Memory (OOM) Killer...
         Starting systemd-resolved.service - Network Name Resolution...
[  OK  ] Started systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer.
[  OK  ] Started systemd-resolved.service - Network Name Resolution.
[  OK  ] Reached target network.target - Network.
[  OK  ] Reached target network-online.target - Network is Online.
[  OK  ] Reached target nss-lookup.target - Host and Network Name Lookups.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, or "exit"
to continue bootup.
Enter root password for system maintenance
(or press Control-D to continue): 

Looks like the cause is var "[FAILED] Failed to mount var.mount - /var."

bash-5.3# systemctl status var.mount
× var.mount - /var
     Loaded: loaded (/run/systemd/generator/var.mount; generated)
     Active: failed (Result: exit-code) since Mon 2025-11-24 07:19:15 UTC; 8min>
 Invocation: bfca267cc38e489d841f59ba05765fa3
      Where: /var
       What: /sysroot/ostree/deploy/default/var
       Docs: man:ostree(1)
   Mem peak: 1M
        CPU: 9ms

Nov 24 07:19:15 default-0 systemd[1]: Mounting var.mount - /var...
Nov 24 07:19:15 default-0 mount[1356]: mount: /var: special device /sysroot/ost>
Nov 24 07:19:15 default-0 mount[1356]:        dmesg(1) may have more informatio>
Nov 24 07:19:15 default-0 systemd[1]: var.mount: Mount process exited, code=exi>
Nov 24 07:19:15 default-0 systemd[1]: var.mount: Failed with result 'exit-code'.
Nov 24 07:19:15 default-0 systemd[1]: Failed to mount var.mount - /var.


dmesg: https://lnie.fedorapeople.org/dmesg.txt

Any idea how to avoid the failure?

FYI, the system would be good after I virsh destroy and then virsh start it

@skycastlelily skycastlelily added the ci | full test Pull request is ready for the full test execution label Nov 24, 2025
@skycastlelily skycastlelily marked this pull request as ready for review November 24, 2025 10:35
)
original_boot_id = boot_id_result.stdout.strip() if boot_id_result.stdout else ""
except Exception:
self.debug("Could not get boot ID, falling back to regular soft reboot detection")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be detected before running the command? With ShellScript('systemctl --help | grep -q "soft-reboot"') we already established we can run systemd soft-reboot, can it happen that we can run this command and not have boot ID?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it seems that a system can have the systemd capability (soft-reboot) without having the specific kernel artifact (/proc/sys/kernel/random/boot_id)
It is possible to have:
A very minimal or specially configured Linux kernel that is new enough to work with a modern systemd but has the generation of boot_id disabled or the file path missing due to specific kernel build options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, good to know.

current_boot_id = boot_id_result.stdout.strip() if boot_id_result.stdout else ""

# For systemd soft reboot, boot ID should be the same
if current_boot_id == original_boot_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a chance for a race condition, testing the boot ID before the soft reboot even begins seems possible. The soft reboot implementation prevent this by comparing the boot time, which would be still the same in such a case, but your code for systemd soft-reboot claims the same boot ID is actually expected. Is there another value that does change after a systemd soft-reboot completes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it need to be fixed^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do need something race-condition-proof, something that does change over systemd soft-reboot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems like the exact match, thank you.

Comment on lines +3166 to +3178
if mode == RebootMode.SYSTEMD_SOFT:
new_marker = get_boot_id()
# For systemd soft reboot, boot ID should stay the same
if new_marker == current_marker:
self.debug(
"Systemd soft reboot completed successfully (boot ID unchanged)."
)
return
self.debug(
f"Boot ID changed from {current_marker} to {new_marker}, "
"this might indicate a hard reboot occurred instead."
)
return # Still accept it as the guest is back up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical logic bug in systemd soft reboot detection. The code accepts when boot_id == current_marker immediately after reconnection succeeds, but this will cause false positives. If the connection doesn't drop quickly enough after triggering the reboot command, the first successful connection attempt will find the boot ID unchanged (because reboot hasn't happened yet) and incorrectly report success.

The code needs to:

  1. First ensure the connection has dropped (guest became unavailable)
  2. Then wait for reconnection
  3. Only then verify boot ID is unchanged

For example, add a flag to track if connection was lost:

connection_dropped = False
try:
    self.execute(Command('whoami'), silent=True)
    if not connection_dropped and mode == RebootMode.SYSTEMD_SOFT:
        # Still connected, reboot hasn't happened
        raise tmt.utils.wait.WaitingIncompleteError
except tmt.utils.RunError:
    connection_dropped = True
    raise tmt.utils.wait.WaitingIncompleteError

if mode == RebootMode.SYSTEMD_SOFT and connection_dropped:
    # Now check boot ID is same
Suggested change
if mode == RebootMode.SYSTEMD_SOFT:
new_marker = get_boot_id()
# For systemd soft reboot, boot ID should stay the same
if new_marker == current_marker:
self.debug(
"Systemd soft reboot completed successfully (boot ID unchanged)."
)
return
self.debug(
f"Boot ID changed from {current_marker} to {new_marker}, "
"this might indicate a hard reboot occurred instead."
)
return # Still accept it as the guest is back up
if mode == RebootMode.SYSTEMD_SOFT:
new_marker = get_boot_id()
# For systemd soft reboot, boot ID should stay the same
if connection_dropped and new_marker == current_marker:
self.debug(
"Systemd soft reboot completed successfully (boot ID unchanged)."
)
return
if connection_dropped:
self.debug(
f"Boot ID changed from {current_marker} to {new_marker}, "
"this might indicate a hard reboot occurred instead."
)
return # Still accept it as the guest is back up
# If we get here without connection_dropped, the reboot hasn't happened yet
raise tmt.utils.wait.WaitingIncompleteError

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

happz pushed a commit that referenced this pull request Nov 26, 2025
happz pushed a commit that referenced this pull request Nov 26, 2025
@github-project-automation github-project-automation bot moved this from implement to done in planning Nov 27, 2025
happz pushed a commit that referenced this pull request Nov 27, 2025
happz pushed a commit that referenced this pull request Nov 30, 2025
happz pushed a commit that referenced this pull request Nov 30, 2025
* Implements another soft reboot mode, a userspace reboot via `systemctl
  soft-reboot` command.
* Reboot-related API gets a new parameter, `mode`, replacing the `hard`
  flag, to make space for the new reboot mode.
* Fixes and improves docstrings of reboot-related API - there were many
  outdated bits and pieces, misleading poor reviewers.

Fixes: #4298
happz pushed a commit that referenced this pull request Dec 1, 2025
* Implements another soft reboot mode, a userspace reboot via `systemctl
  soft-reboot` command.
* Reboot-related API gets a new parameter, `mode`, replacing the `hard`
  flag, to make space for the new reboot mode.
* Fixes and improves docstrings of reboot-related API - there were many
  outdated bits and pieces, misleading poor reviewers.

Fixes: #4298
happz pushed a commit that referenced this pull request Dec 2, 2025
* Implements another soft reboot mode, a userspace reboot via `systemctl
  soft-reboot` command.
* Reboot-related API gets a new parameter, `mode`, replacing the `hard`
  flag, to make space for the new reboot mode.
* Fixes and improves docstrings of reboot-related API - there were many
  outdated bits and pieces, misleading poor reviewers.

Fixes: #4298
psss pushed a commit that referenced this pull request Dec 4, 2025
* Implements another soft reboot mode, a userspace reboot via `systemctl
  soft-reboot` command.
* Reboot-related API gets a new parameter, `mode`, replacing the `hard`
  flag, to make space for the new reboot mode.
* Fixes and improves docstrings of reboot-related API - there were many
  outdated bits and pieces, misleading poor reviewers.

Fixes: #4298
@psss psss removed this from the 1.63 milestone Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci | full test Pull request is ready for the full test execution command | reboot Support for rebooting guests during `tmt run` and the `tmt-reboot` command

Projects

Status: done

Development

Successfully merging this pull request may close these issues.

support soft reboots

5 participants