Skip to content

systemd: wait for udev to settle #762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yarda
Copy link
Contributor

@yarda yarda commented Mar 27, 2025

This should help with races caused by udev renaming network devices.

@zacikpa
Copy link
Contributor

zacikpa commented Mar 27, 2025

From man systemd-udev-settle.service:

Using this service is not recommended.

Waiting for systemd-udev-settle.service usually slows boot significantly, ...

This sounds a bit concerning to me.

@jmencak
Copy link
Contributor

jmencak commented Mar 28, 2025

This sounds a bit concerning to me.

It does. Not sure if there's a better way to solve this. This certainly will not help on OCP where we'd have to do a similar thing, but for kubelet, because kubelet starts TuneD pods. On the other hand, we've verified this helped to work around the issue at least on RHEL.

@yarda
Copy link
Contributor Author

yarda commented Mar 28, 2025

It depends whether there are other boot critical services waiting on TuneD.

For kubelet the following may work (-t 60 : give up after 60 seconds):

# udevadm settle -t 60 && tuned ...

@jmencak
Copy link
Contributor

jmencak commented Mar 28, 2025

It depends whether there are other boot critical services waiting on TuneD.

True. booting fast and then running TuneD when (perhaps latency-critical) apps are already running might not help either. In the case of latency-critical apps quite the opposite.

For kubelet the following may work (-t 60 : give up after 60 seconds):

# udevadm settle -t 60 && tuned ...

That might be one of the options. Thinking OpenShift now, perhaps only do this in our ocp-tuned-one-shot.service to start with.

@MarSik , thoughts?

@yarda yarda force-pushed the boot-udev-race-fix branch from 5e3014f to 6a29c9d Compare April 2, 2025 00:26
This should help with races caused by udev renaming network devices.

Signed-off-by: Jaroslav Škarvada <jskarvad@redhat.com>
@MarSik
Copy link
Contributor

MarSik commented Apr 2, 2025

@jmencak The one shot service is a prerequisite for kubelet anyway so it makes little difference. But of course the early tuned execution should already see the proper names.

I am a bit worried what will happen on systems with remote storage though (= a lot of disks).

@jmencak
Copy link
Contributor

jmencak commented Apr 2, 2025

@jmencak The one shot service is a prerequisite for kubelet anyway so it makes little difference. But of course the early tuned execution should already see the proper names.

I am a bit worried what will happen on systems with remote storage though (= a lot of disks).

I'd say the key is finding the "sweet" spot how long to wait before giving up and timing out in favour of proceeding. I.e. not blocking the kubelet in OCP and tuned itself in RHEL/other_OS (in this case) too long. As for this PR, I'd probably like to see some reasonable timeout somewhere, I haven't investigated if systemd-udev-settle.service provides it.

@yarda
Copy link
Contributor Author

yarda commented Apr 2, 2025

It isn't rocket science behind the udevadm settle. All what it does is it waits for the udev queue to be empty. If this check is already there when the kubelet is started, the udev queue fills probably later. The problem is that TuneD gets storm of the udev events during its startup and in arbitrary order. So it can get remove event for the device which is physically still there (but started unplugging), is partially removed or it doesn't exist anymore for some time. It's also getting remove events for devices which are still initializing.

The most clean aproach is to ignore the udev events until TuneD is fully initialized, in this way we could miss network adapters rename events (which it gets a lot during startup and I think it's because TuneD process is started in the wrong time when another process is renaming network adapters) and add events, so some newly added devices needn't be tuned.

Even redesign wouldn't help much, because even with the one worker tuning thread, when it start processing add event, the device in question could be already added, renamed or removed several times and it could happen even during the time the device is being tuned, because applying multiple tunings to the device isn't atomic operation. This complicates the process a lot, because backend tools (like e.g. ethtool) usually doesn't have special error codes for non-existent devices and we would have to parse the error messages from the tools (which change between releases) to find out whether TuneD should or shouldn't report the error.

Nevertheless, we are adding patches improving the situation, but being able to postpone TuneD start after most of the existing network adapters are renamed would help a lot with possible future problems.

@jmencak
Copy link
Contributor

jmencak commented Apr 2, 2025

It isn't rocket science behind the udevadm settle. All what it does is it waits for the udev queue to be empty. If this check is already there when the kubelet is started, the udev queue fills probably later. The problem is that TuneD gets storm of the udev events during its startup and in arbitrary order. So it can get remove event for the device which is physically still there (but started unplugging), is partially removed or it doesn't exist anymore for some time. It's also getting remove events for devices which are still initializing.

Looking at the man page of systemd-udev-settle.service, even using this service gives you no guarantees it will wait for all events. All I'm after is finding the "sweet" spot how long to wait to prevent the majority of the events being triggered while TuneD runs. By default (at least on RHEL) the timeout for this service seems to be 180s. Is that enough? Is that too low? That's what I'm asking. Good to see that there at least is a timeout.

Nevertheless, we are adding patches improving the situation, but being able to postpone TuneD start after most of the existing network adapters are renamed would help a lot with possible future problems.

Agreed.

@yarda
Copy link
Contributor Author

yarda commented Apr 2, 2025

Looking at the man page of systemd-udev-settle.service, even using this service gives you no guarantees it will wait for all events. All I'm after is finding the "sweet" spot how long to wait to prevent the majority of the events being triggered while TuneD runs. By default (at least on RHEL) the timeout for this service seems to be 180s. Is that enough? Is that too low? That's what I'm asking. Good to see that there at least is a timeout.

On Fedora the default timeout is 120 s. It's the maximum number of seconds to wait if the queue still isn't emptied which IMHO on normal system the queue is emptied in cca. several seconds at max.

So let's say the queue is emptied in cca. 2 seconds, it means 2 seconds boot delay and after the 2 seconds the udevadm settle will return, i.e. the udevadm settle is equivalent to the sleep 2 in such case.

If the queue isn't emptied in 120 s (the default setting on Fedora if the -t isn't specified), it means there is probably something really wrong with the system and the udevadm settle call returns after 120 s and is equivalent to the sleep 120 in such case.

IMHO @zacikpa did some measurements of the boot delay on Fedora w/wo the systemd-udev-settle.service. @zacikpa do you have any usable results?

@zacikpa
Copy link
Contributor

zacikpa commented Apr 3, 2025

@yarda TBH, I only tried to measure it on machines with very few devices (say, a laptop) and there was never any delay higher than normal boot time variance.

@jmencak
Copy link
Contributor

jmencak commented Apr 3, 2025

@yarda TBH, I only tried to measure it on machines with very few devices (say, a laptop) and there was never any delay higher than normal boot time variance.

The real test will be deployments with various network-attached storage devices. Then I suspect we'll hit cases where depolyments mostly benefit from this, but I'm sure there will be outliers where it is preferable to have partial tuning in place with a few misses. I guess we don't have a better solution right now and time will tell. What is good there at least seems to be a reasonable timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants