Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grub-boot-success.timer interfering with greenboot #108

Open
say-paul opened this issue Jul 14, 2023 · 41 comments
Open

grub-boot-success.timer interfering with greenboot #108

say-paul opened this issue Jul 14, 2023 · 41 comments

Comments

@say-paul
Copy link
Member

ssh session seems to trigger the grub-boot-success.timer from grub2-tools to set the boot success flag, which means that grub does not decrement the boot counter if there is a reboot.

@nullr0ute
Copy link
Member

Going to need more concrete evidence that a brief statement

@pmtk
Copy link

pmtk commented Jul 14, 2023

I encountered this issue recently, I can provide some details.

Package grub2-tools provides two files involved in this problem:

/usr/lib/systemd/user/grub-boot-success.service
/usr/lib/systemd/user/grub-boot-success.timer

After user logs in (e.g. with SSH), the timer starts and after 2 minutes the service runs (ExecStart=/usr/sbin/grub2-set-bootflag boot_success).

The same package provides /etc/grub.d/08_fallback_counting in which we can see:

# Check if boot_counter exists and boot_success=0 to activate this behaviour.
if [ -n "\${boot_counter}" -a "\${boot_success}" = "0" ]; then

It means that: if user logs in during the healthchecks and running healthchecks takes longer than these 2 minutes needed to trigger grub-boot-success.service, then grub won't decrement the boot_counter resulting in more "reboots to heal" than expected (e.g configured in greenboot).

This is a log I obtained when discovered the issue:

$ sudo journalctl -u greenboot-healthcheck | grep -E "\-\- Boot|boot_[s|c]"
-- Boot c3172dea78584d9b9879f818be8c1f34 --
Jul 05 12:36:41 ostree-dev 40_microshift_running_check.sh[1189]: boot_success=0
Jul 05 12:36:41 ostree-dev 40_microshift_running_check.sh[1189]: boot_counter=2

-- Boot 4081b57ed6284b07a48c105c5be9a79e --
Jul 05 12:41:12 ostree-dev 40_microshift_running_check.sh[1233]: boot_success=0
Jul 05 12:41:12 ostree-dev 40_microshift_running_check.sh[1233]: boot_counter=2

-- Boot e7fc9bb7e24f4ea080c7c4d486533567 --
Jul 05 12:44:45 ostree-dev 40_microshift_running_check.sh[1281]: boot_success=0
Jul 05 12:44:45 ostree-dev 40_microshift_running_check.sh[1281]: boot_counter=2

-- Boot 9e77bc317ad54761845d53118ef52c9b --
Jul 05 12:48:21 ostree-dev 40_microshift_running_check.sh[1222]: boot_success=0
Jul 05 12:48:21 ostree-dev 40_microshift_running_check.sh[1222]: boot_counter=2

-- Boot 9db0cd6a1bd74bb6a2ef3bfb19bfc2e5 --
Jul 05 12:51:58 ostree-dev 40_microshift_running_check.sh[1190]: boot_success=0
Jul 05 12:51:58 ostree-dev 40_microshift_running_check.sh[1190]: boot_counter=1

-- Boot 05f39c32ef48400ea07cb000af463ecb --
Jul 05 12:55:16 ostree-dev 40_microshift_running_check.sh[1283]: boot_success=0
Jul 05 12:55:16 ostree-dev 40_microshift_running_check.sh[1283]: boot_counter=1

-- Boot bbac60cff9a84ef19ed99a949dd38d97 --
Jul 05 12:58:50 ostree-dev 40_microshift_running_check.sh[1189]: boot_success=0
Jul 05 12:58:50 ostree-dev 40_microshift_running_check.sh[1189]: boot_counter=1

-- Boot b2c67e9a44764388aa4c9fd120141a2c --
Jul 05 13:02:30 ostree-dev 40_microshift_running_check.sh[1187]: boot_success=0
Jul 05 13:02:30 ostree-dev 40_microshift_running_check.sh[1187]: boot_counter=0

-- Boot d2e24133fd294f5580f40cd5537512ed --

@LorbusChris
Copy link
Member

To me it looks like grub-boot-success.{service,timer} should be excluded from Fedora IoT and other OSes that use greenboot.

@LorbusChris
Copy link
Member

Further, if grub-boot-success.service and grub2-set-success.service were more aligned (i.e. grub-boot-success.service would also ensure the boot_counter is unset) Fedora IoT could also just drop the timer, drop greenboot-grub2-set-success.service and run with the grub-boot-success.service.

@say-paul
Copy link
Member Author

say-paul commented Jul 21, 2023

We can use this as a workaround:
systemctl --user mask grub-boot-success.timer

@say-paul
Copy link
Member Author

@LorbusChris I there a way to drop just these two: grub-boot-success.{service,timer}
or we may need to remove the entire rpm: grub2-tools , If its the later then I think we can go with masking them unless we understand the bigger implications of removing the entire repo.

@LorbusChris
Copy link
Member

IMO This should be solved by using preset that ensures grub-boot-success.{service,timer} is disabled on Fedora IoT/RHEL Edge. They should really only be enabled on Desktop variants, not on Server/IoT.

@LorbusChris
Copy link
Member

Less clean and less preferable, but also working as you mentioned, would be to mask them, which you might be able to do in osbuild/Image Builder.

@pmtk
Copy link

pmtk commented Jul 23, 2023

We use masking when testing MicroShift.

I'm not sure, but it looks like osbuild might not support masking, only enabling / disabling (which does not work for timers): https://www.osbuild.org/guides/image-builder-on-premises/blueprint-reference.html#systemd-services

Unless you have another idea how to handle it? We were also thinking about putting systemctl [--user] mask in the kickstart

@LorbusChris
Copy link
Member

LorbusChris commented Jul 24, 2023

On Fedora IoT, this timer is already be disabled by this preset: https://src.fedoraproject.org/rpms/fedora-release/blob/rawhide/f/80-iot-user.preset#_6

This is included in the fedora-release-iot rpm.

Doing this via a preset config is the proper way to do it.

@pmtk
Copy link

pmtk commented Jul 24, 2023

But this is method for Fedora IoT only, right?
For RHEL For Edge documentation recommends osbuild and I'm not aware of such option in the blueprint. Or is there a way for R4E?

@say-paul
Copy link
Member Author

Evaluating ways to do in osbuild-composer also, might require some modifications in the source code.

@say-paul
Copy link
Member Author

The fix PR can be tracked here : osbuild/images#51

@miabbott
Copy link
Member

Doing this via a preset config is the proper way to do it.

While I think this is the correct way to do it, this is harder to do for RHEL for Edge because we don't differentiate between RHEL Server and RHEL for Edge.

We got into what I think is a similar discussion when we looked into changing /etc/os-release to indicate that the system was running RHEL for Edge and how it could be done. The takeaway is that we don't want to ship something like a redhat-release-edge RPM that includes RHEL for Edge specific config because it weakens the story that RHEL for Edge is just RHEL and it opens the door for someone on RHEL Server to install that package and cause problems.

It would be great if we could define systemd presets for RHEL for Edge in osbuild itself; this would be akin to what is done in Red Hat CoreOS when that is built.

@miabbott
Copy link
Member

It would be great if we could define systemd presets for RHEL for Edge in osbuild itself; this would be akin to what is done in Red Hat CoreOS when that is built.

So I guess there is a stage for this already? osbuild/osbuild#1269

@say-paul
Copy link
Member Author

say-paul commented Jul 27, 2023

I have been trying to test it with presets in both:
/usr/lib/systemd/user-preset/
and
/usr/lib/systemd/systemd-preset/
landed using osbuild-composer
but none seems to disable it.
fedora disables it using - https://src.fedoraproject.org/rpms/fedora-release/blob/rawhide/f/80-iot-user.preset
not sure why its not working in rhel.

[admin@localhost ~]$ cat /usr/lib/systemd/user-preset/50-osbuild.preset 
disable grub-boot-success.timer
[admin@localhost ~]$ systemctl --user status grub-boot-success.timer
● grub-boot-success.timer - Mark boot as successful after the user session has run 2 minutes
     Loaded: loaded (/usr/lib/systemd/user/grub-boot-success.timer; static)
     Active: active (elapsed) since Thu 2023-07-27 10:24:41 EDT; 21min ago
      Until: Thu 2023-07-27 10:24:41 EDT; 21min ago
    Trigger: n/a
   Triggers: ● grub-boot-success.service

Jul 27 10:24:41 localhost.localdomain systemd[1242]: Started Mark boot as successful after the user session has run 2 minutes.

@dhellmann
Copy link

@say-paul I'm not an expert on this, but from what I've found it looks like "mask" is different from "disable". I think the main difference is that something that is disabled will still be run if it is a dependency of another service, but if it is masked then it will not be run ever. Is it possible that something in RHEL has declared a dependency on grub-boot-success.timer?

@say-paul
Copy link
Member Author

I have look to that out , fedora-release: https://src.fedoraproject.org/rpms/fedora-release/blob/rawhide/f/fedora-release.spec#_1363 does the same here.

@runcom @LorbusChris @nullr0ute any suggestions, on why presets are not working?

@runcom
Copy link
Member

runcom commented Jul 28, 2023

@runcom @LorbusChris @nullr0ute any suggestions, on why presets are not working?

they're likely not working because when the system boots, it has a machine-id already and presets aren't run - this is how osbuild works - the workaround for edge+ignition was to do this https://github.com/osbuild/osbuild-composer/blob/8ff4c0c40af0ee1f25da5336733a5876e5d2b82a/test/data/manifests/rhel_92-x86_64-edge_ami-boot.json#L2396-L2397 which basically adds two more kernel arguments to the system on first boot to mimic firstboot with machine-id. For systems not using ignition (like the fdo case maybe, or normally) we have no way to define "firstboot only kargs to mimic firstboot" so something else has to be thought. Sayan, you can test that presets work by building a rhel for edge artifacts that uses ignition.

@say-paul
Copy link
Member Author

say-paul commented Aug 4, 2023

Tried with ignition too, but the service is still active.

whats worked is a dropin for grub-boot-success.timer:

[core@localhost ~]$ systemctl --user status grub-boot-success.timer
○ grub-boot-success.timer - Mark boot as successful after the user session has run 2 minutes
     Loaded: loaded (/usr/lib/systemd/user/grub-boot-success.timer; static)
    Drop-In: /usr/lib/systemd/user/grub-boot-success.timer.d
             └─10-disable.conf
     Active: inactive (dead)
    Trigger: n/a
   Triggers: ● grub-boot-success.service
  Condition: start condition failed at Fri 2023-08-04 04:48:41 EDT; 21s ago
             └─ ConditionPathExists=!/usr/libexec/greenboot was not met

Aug 04 04:48:41 localhost.localdomain systemd[1111]: Mark boot as successful after the user session has run 2 minutes was skipped because of an unmet condition check (ConditionPathExists=!/usr/li>

@LorbusChris
Copy link
Member

I think we have to be careful how we word things here.

Presets work as expected in Fedora IoT. They are shipped in the fedora-release-iot rpm.
There is no such variant-specific rpm containing presets for RHEL Edge.

When the iot presets are present in an rpm consumed by rpm-ostree compose or equivalent used by ImageBuilder, things should just work(TM).

To me it seems the imperative post-processing for systemd unit activation that ImageBuilder does contradicts the declarative config approach taken elsewhere, e.g. by systemd and ignition.

In CoreOS / ostree-native images, the initial presets are applied during rpm-ostree compose.
Layering additional presets on top of a base image is done by re-running systemctl preset-all during the layered container image build, see: https://github.com/openshift/okd-machine-os/blob/de94e64a1ce6436a0c8c0e72982d41a1163a658d/Dockerfile#L35

Have you thought about shipping the iot presets in the rhel build of the greenboot rpm to avoid the configuring this through ImageBuilder's mechanism or ignition?

@say-paul
Copy link
Member Author

say-paul commented Aug 7, 2023

@LorbusChris I tried running systemctl preset-all manually with no effect on the timer
I can see the grub-boot-success.timer is statically linked
So my understanding is, its not possible for systemd to enable/disable it.

@pmtk
Copy link

pmtk commented Aug 7, 2023

Just FYI, we made a change in MicroShift's kickstart to manually mask the timer (src):

ln -sf /dev/null /etc/systemd/user/grub-boot-success.timer

Not sure if I missed something in the discussion, but you cannot disable/enable timers, you can only mask them. Could that be the reason for problems with presets?

@LorbusChris
Copy link
Member

LorbusChris commented Aug 7, 2023

@LorbusChris I tried running systemctl preset-all manually with no effect on the timer I can see the grub-boot-success.timer is statically linked So my understanding is, its not possible for systemd to enable/disable it.

That's wrong. Creation/deletion of that symbolic link is exactly how systemd enables/disables things.
This should work:

  • create the preset config disabling grub-boot-success.timer
  • run systemctl preset-all --user
  • -> Removed "/var/home/lorbus/.config/systemd/user/timers.target.wants/grub-boot-success.timer".

@LorbusChris
Copy link
Member

Again, have you considered shipping the preset as part of the RPM? With the preset present at compose time, things should just work, like they do in Fedora IoT

@say-paul
Copy link
Member Author

say-paul commented Aug 7, 2023

the preset was not part of the rpm but a stage during commit time.
Also when manually trying to disable I get:

[core@localhost ~]$systemctl --user enable grub-boot-success.timer
The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled or disabled using systemctl.
 
Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.

I understand system presets acts similar to systemctl enable/disable

@LorbusChris LorbusChris changed the title ssh-sessions interfering with greenboot grub-boot-success.timer not disabled on RHEL Aug 7, 2023
@LorbusChris
Copy link
Member

In order to add the preset in the RPM for RHEL builds only, you can do something like this in the specfile:

%if 0%{?rhel}
  # create/copy preset config
%endif

@say-paul
Copy link
Member Author

say-paul commented Aug 8, 2023

@LorbusChris I am inclined to test it if the service can be disabled manually then moving into automating the whole process:

This should work:

create the preset config disabling grub-boot-success.timer
run systemctl preset-all --user
-> Removed "/var/home/lorbus/.config/systemd/user/timers.target.wants/grub-boot-success.timer".

Is this thing working for you as for me its not, I am testing this on rhel 9.2.
Even @pmtk @dhellmann is facing similar issue when trying to disable the timer manually.

@achilleas-k
Copy link

achilleas-k commented Aug 8, 2023

That's wrong. Creation/deletion of that symbolic link is exactly how systemd enables/disables things.

But as far as I understand, for systemd to be be able to enable/disable something, the unit or timer needs to define [Install] section. In the case of the grub-boot-success.timer, there is no [Install] section. The timer is enabled through a symlink in the rpm. Any attempt to disable it (directly or through presets) fails because there's no indication in the unit itself which target the timer is linked to.

Am I misunderstanding something?

@LorbusChris
Copy link
Member

Not sure if I missed something in the discussion, but you cannot disable/enable timers, you can only mask them. Could that be the reason for problems with presets?

You can usually, but not with this specific timer, because it doesn't have an [Install] section on the unit file.
Looking at it more closely, I see this timer is statically enabled (not statically linked) by a symlink.

The grub rpm creates that symlink for static enablement unconditionally instead of using the usual %systemd_user_post macro (which might not work in grub's case because of the missing [Install] section..).

If you can verify that this issue DOES NOT occur on Fedora IoT, then I think you should try to ship the preset in the rpm and see where that leads.

If this also occurs on FIoT in spite of the presets, the proper way to fix this IMO would be to split grub-boot-success.timer and the /usr/lib/systemd/user/timers.target.wants/grub-boot-success.timer symlink out into a grub subpackage which you'd exclude in the rpm-ostree compose.

Symlinking /dev/null will work but seems hacky to me. Alternatively, you could also just delete /usr/lib/systemd/user/timers.target.wants/grub-boot-success.timer symlink which statically enables the timer in osbuild.

@LorbusChris
Copy link
Member

@achilleas-k that's correct, I didn't look closely at the timer unit before. See my previous comment.

@achilleas-k
Copy link

the proper way to fix this IMO would be to split grub-boot-success.timer and the /usr/lib/systemd/user/timers.target.wants/grub-boot-success.timer symlink out into a grub subpackage which you'd exclude in the rpm-ostree compose.

We all agree on that, but we need a workaround for now, which is why we ended up talking about masking or adding a drop-in condition for the grub-boot-success.timer to not start if greenboot is running.

@LorbusChris
Copy link
Member

Does this occur on Fedora IoT or not?

@achilleas-k
Copy link

It does. The issue is present in both IoT images built by image builder and I also just checked the official raw image from https://fedoraproject.org/iot/download/

@LorbusChris
Copy link
Member

Thanks, that means the preset does not work as expected, not even on Fedora IoT where it exists.

@LorbusChris LorbusChris changed the title grub-boot-success.timer not disabled on RHEL grub-boot-success.timer interfering with greenboot Aug 8, 2023
@say-paul
Copy link
Member Author

say-paul commented Aug 8, 2023

@pmtk yes masking seemes to be an option , the bigger goal will to set it up from the image itself. The solution that worked for me is adding a dropin.

@miabbott
Copy link
Member

miabbott commented Aug 9, 2023

@LorbusChris Sayan had previously filed - https://bugzilla.redhat.com/show_bug.cgi?id=2229703

Shall we close that as a dupe of yours, since you have baggage attached to your BZ already?

@LorbusChris
Copy link
Member

Wfm

@say-paul
Copy link
Member Author

osbuild/images#51 disables the timer via dropin

@LorbusChris
Copy link
Member

Let's reopen this until the issue is properly fixed on Fedora IoT, too.

@LorbusChris LorbusChris reopened this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants