Conversation
|
Some more info on why systemd soft reboot is needed is useful either in the comments or the PR description. Can't judge right now, if we should make that the default whenever systemd is present, or a feature specific to systemd or something else entirely |
7332e78 to
94d057e
Compare
|
Hi @cgwalters, I'm working on this mr, it works with virtual plugin, but not bootc plugin, and I manually ssh into the bootc system created by bootc plugin, then run "systemctl soft-reboot", then I'm not able to login to that system anymore, do you have idea why, or what additional steps should I add, or any hints,guides?Thanks:) |
|
One preliminary comment: "soft reboot" already has a history and its own meaning, which is by no means the same as "systemd soft-reboot". Please, consider using different name/label/variables (e.g. |
|
Yes true, but OTOH I think it should be quite unusual and rare for code executed inside a guest to "reach out" to a hypervisor or control plane and do a physical reboot. That's the case that I think needs dedicated nomenclature. I would probably rename the But yes arguably too systemd's soft reboots should probably have been called an |
Offhand it works for me (I happened to be test in a My recommendation here is to be sure you have a console set up at least for debugging. |
Not necessarily the code running inside a guest, but for tmt this is a real situation. A guest may freeze, crash, e.g. thanks to various kernel torturing tests, tests causing kernel oops on purpose, and tmt does have tools to invoke "hard" reboot to restore the law and order. tmt does not care that much about how it's implemented, whether it's a magic of Beaker or libvirt & qemu, but it's called "hard reboot" in tmt codebase.
That would be possible.
I think sticking to "systemd soft-reboot" term should be enough, it just needs to be consistent. My point was to double check changes to avoid variables like |
94d057e to
1843f01
Compare
sure, here is the reproducer:
with quay.io/almalinuxorg/almalinux-bootc:10.0,it indeed work, though there is a failed service after systemctl soft-reboot:
here is part of the console output: Looks like the cause is var "[FAILED] Failed to mount var.mount - /var." dmesg: https://lnie.fedorapeople.org/dmesg.txt Any idea how to avoid the failure? FYI, the system would be good after I virsh destroy and then virsh start it |
tmt/steps/provision/__init__.py
Outdated
| ) | ||
| original_boot_id = boot_id_result.stdout.strip() if boot_id_result.stdout else "" | ||
| except Exception: | ||
| self.debug("Could not get boot ID, falling back to regular soft reboot detection") |
There was a problem hiding this comment.
Can this be detected before running the command? With ShellScript('systemctl --help | grep -q "soft-reboot"') we already established we can run systemd soft-reboot, can it happen that we can run this command and not have boot ID?
There was a problem hiding this comment.
Hmmm, it seems that a system can have the systemd capability (soft-reboot) without having the specific kernel artifact (/proc/sys/kernel/random/boot_id)
It is possible to have:
A very minimal or specially configured Linux kernel that is new enough to work with a modern systemd but has the generation of boot_id disabled or the file path missing due to specific kernel build options.
tmt/steps/provision/__init__.py
Outdated
| current_boot_id = boot_id_result.stdout.strip() if boot_id_result.stdout else "" | ||
|
|
||
| # For systemd soft reboot, boot ID should be the same | ||
| if current_boot_id == original_boot_id: |
There was a problem hiding this comment.
There seems to be a chance for a race condition, testing the boot ID before the soft reboot even begins seems possible. The soft reboot implementation prevent this by comparing the boot time, which would be still the same in such a case, but your code for systemd soft-reboot claims the same boot ID is actually expected. Is there another value that does change after a systemd soft-reboot completes?
There was a problem hiding this comment.
yes, it need to be fixed^^
There was a problem hiding this comment.
Yeah, we do need something race-condition-proof, something that does change over systemd soft-reboot.
There was a problem hiding this comment.
This is a working equivalent system coreos/coreos-assembler@df29ea1 and specifically what you're looking for is in coreos/coreos-assembler@df29ea1#diff-2dc3b436d130e33087dc0a2bd0d0062a7a5ecd74410297cd847fd11ffbb31c99R220
There was a problem hiding this comment.
Yes, that seems like the exact match, thank you.
| if mode == RebootMode.SYSTEMD_SOFT: | ||
| new_marker = get_boot_id() | ||
| # For systemd soft reboot, boot ID should stay the same | ||
| if new_marker == current_marker: | ||
| self.debug( | ||
| "Systemd soft reboot completed successfully (boot ID unchanged)." | ||
| ) | ||
| return | ||
| self.debug( | ||
| f"Boot ID changed from {current_marker} to {new_marker}, " | ||
| "this might indicate a hard reboot occurred instead." | ||
| ) | ||
| return # Still accept it as the guest is back up |
There was a problem hiding this comment.
Critical logic bug in systemd soft reboot detection. The code accepts when boot_id == current_marker immediately after reconnection succeeds, but this will cause false positives. If the connection doesn't drop quickly enough after triggering the reboot command, the first successful connection attempt will find the boot ID unchanged (because reboot hasn't happened yet) and incorrectly report success.
The code needs to:
- First ensure the connection has dropped (guest became unavailable)
- Then wait for reconnection
- Only then verify boot ID is unchanged
For example, add a flag to track if connection was lost:
connection_dropped = False
try:
self.execute(Command('whoami'), silent=True)
if not connection_dropped and mode == RebootMode.SYSTEMD_SOFT:
# Still connected, reboot hasn't happened
raise tmt.utils.wait.WaitingIncompleteError
except tmt.utils.RunError:
connection_dropped = True
raise tmt.utils.wait.WaitingIncompleteError
if mode == RebootMode.SYSTEMD_SOFT and connection_dropped:
# Now check boot ID is same| if mode == RebootMode.SYSTEMD_SOFT: | |
| new_marker = get_boot_id() | |
| # For systemd soft reboot, boot ID should stay the same | |
| if new_marker == current_marker: | |
| self.debug( | |
| "Systemd soft reboot completed successfully (boot ID unchanged)." | |
| ) | |
| return | |
| self.debug( | |
| f"Boot ID changed from {current_marker} to {new_marker}, " | |
| "this might indicate a hard reboot occurred instead." | |
| ) | |
| return # Still accept it as the guest is back up | |
| if mode == RebootMode.SYSTEMD_SOFT: | |
| new_marker = get_boot_id() | |
| # For systemd soft reboot, boot ID should stay the same | |
| if connection_dropped and new_marker == current_marker: | |
| self.debug( | |
| "Systemd soft reboot completed successfully (boot ID unchanged)." | |
| ) | |
| return | |
| if connection_dropped: | |
| self.debug( | |
| f"Boot ID changed from {current_marker} to {new_marker}, " | |
| "this might indicate a hard reboot occurred instead." | |
| ) | |
| return # Still accept it as the guest is back up | |
| # If we get here without connection_dropped, the reboot hasn't happened yet | |
| raise tmt.utils.wait.WaitingIncompleteError |
Spotted by Graphite Agent
Is this helpful? React 👍 or 👎 to let us know.
* Implements another soft reboot mode, a userspace reboot via `systemctl soft-reboot` command. * Reboot-related API gets a new parameter, `mode`, replacing the `hard` flag, to make space for the new reboot mode. * Fixes and improves docstrings of reboot-related API - there were many outdated bits and pieces, misleading poor reviewers. Fixes: #4298
* Implements another soft reboot mode, a userspace reboot via `systemctl soft-reboot` command. * Reboot-related API gets a new parameter, `mode`, replacing the `hard` flag, to make space for the new reboot mode. * Fixes and improves docstrings of reboot-related API - there were many outdated bits and pieces, misleading poor reviewers. Fixes: #4298
* Implements another soft reboot mode, a userspace reboot via `systemctl soft-reboot` command. * Reboot-related API gets a new parameter, `mode`, replacing the `hard` flag, to make space for the new reboot mode. * Fixes and improves docstrings of reboot-related API - there were many outdated bits and pieces, misleading poor reviewers. Fixes: #4298
* Implements another soft reboot mode, a userspace reboot via `systemctl soft-reboot` command. * Reboot-related API gets a new parameter, `mode`, replacing the `hard` flag, to make space for the new reboot mode. * Fixes and improves docstrings of reboot-related API - there were many outdated bits and pieces, misleading poor reviewers. Fixes: #4298
Fixes: #4298
I guess we should support systemctl soft-reboot for all possible plugins?
Because it could help our users to update all of userspace (when there's no kernel changes) efficiently, which means shorter reboot time.
And, we should not make it default.
"It definitely can't work to change all reboots to soft reboots as some use cases will want to change the kernel for example."
Pull Request Checklist