systemd: wait for udev to settle #762

yarda · 2025-03-27T00:30:12Z

This should help with races caused by udev renaming network devices.

zacikpa · 2025-03-27T11:59:51Z

From man systemd-udev-settle.service:

Using this service is not recommended.

Waiting for systemd-udev-settle.service usually slows boot significantly, ...

This sounds a bit concerning to me.

jmencak · 2025-03-28T10:10:28Z

This sounds a bit concerning to me.

It does. Not sure if there's a better way to solve this. This certainly will not help on OCP where we'd have to do a similar thing, but for kubelet, because kubelet starts TuneD pods. On the other hand, we've verified this helped to work around the issue at least on RHEL.

yarda · 2025-03-28T10:19:59Z

It depends whether there are other boot critical services waiting on TuneD.

For kubelet the following may work (-t 60 : give up after 60 seconds):

# udevadm settle -t 60 && tuned ...

jmencak · 2025-03-28T11:48:44Z

It depends whether there are other boot critical services waiting on TuneD.

True. booting fast and then running TuneD when (perhaps latency-critical) apps are already running might not help either. In the case of latency-critical apps quite the opposite.

For kubelet the following may work (-t 60 : give up after 60 seconds):
# udevadm settle -t 60 && tuned ...

That might be one of the options. Thinking OpenShift now, perhaps only do this in our ocp-tuned-one-shot.service to start with.

@MarSik , thoughts?

This should help with races caused by udev renaming network devices. Signed-off-by: Jaroslav Škarvada <jskarvad@redhat.com>

MarSik · 2025-04-02T08:23:43Z

@jmencak The one shot service is a prerequisite for kubelet anyway so it makes little difference. But of course the early tuned execution should already see the proper names.

I am a bit worried what will happen on systems with remote storage though (= a lot of disks).

jmencak · 2025-04-02T10:33:49Z

@jmencak The one shot service is a prerequisite for kubelet anyway so it makes little difference. But of course the early tuned execution should already see the proper names.

I am a bit worried what will happen on systems with remote storage though (= a lot of disks).

I'd say the key is finding the "sweet" spot how long to wait before giving up and timing out in favour of proceeding. I.e. not blocking the kubelet in OCP and tuned itself in RHEL/other_OS (in this case) too long. As for this PR, I'd probably like to see some reasonable timeout somewhere, I haven't investigated if systemd-udev-settle.service provides it.

yarda · 2025-04-02T10:46:05Z

It isn't rocket science behind the udevadm settle. All what it does is it waits for the udev queue to be empty. If this check is already there when the kubelet is started, the udev queue fills probably later. The problem is that TuneD gets storm of the udev events during its startup and in arbitrary order. So it can get remove event for the device which is physically still there (but started unplugging), is partially removed or it doesn't exist anymore for some time. It's also getting remove events for devices which are still initializing.

The most clean aproach is to ignore the udev events until TuneD is fully initialized, in this way we could miss network adapters rename events (which it gets a lot during startup and I think it's because TuneD process is started in the wrong time when another process is renaming network adapters) and add events, so some newly added devices needn't be tuned.

Even redesign wouldn't help much, because even with the one worker tuning thread, when it start processing add event, the device in question could be already added, renamed or removed several times and it could happen even during the time the device is being tuned, because applying multiple tunings to the device isn't atomic operation. This complicates the process a lot, because backend tools (like e.g. ethtool) usually doesn't have special error codes for non-existent devices and we would have to parse the error messages from the tools (which change between releases) to find out whether TuneD should or shouldn't report the error.

Nevertheless, we are adding patches improving the situation, but being able to postpone TuneD start after most of the existing network adapters are renamed would help a lot with possible future problems.

jmencak · 2025-04-02T11:12:40Z

It isn't rocket science behind the udevadm settle. All what it does is it waits for the udev queue to be empty. If this check is already there when the kubelet is started, the udev queue fills probably later. The problem is that TuneD gets storm of the udev events during its startup and in arbitrary order. So it can get remove event for the device which is physically still there (but started unplugging), is partially removed or it doesn't exist anymore for some time. It's also getting remove events for devices which are still initializing.

Looking at the man page of systemd-udev-settle.service, even using this service gives you no guarantees it will wait for all events. All I'm after is finding the "sweet" spot how long to wait to prevent the majority of the events being triggered while TuneD runs. By default (at least on RHEL) the timeout for this service seems to be 180s. Is that enough? Is that too low? That's what I'm asking. Good to see that there at least is a timeout.

Nevertheless, we are adding patches improving the situation, but being able to postpone TuneD start after most of the existing network adapters are renamed would help a lot with possible future problems.

Agreed.

yarda · 2025-04-02T21:27:26Z

Looking at the man page of systemd-udev-settle.service, even using this service gives you no guarantees it will wait for all events. All I'm after is finding the "sweet" spot how long to wait to prevent the majority of the events being triggered while TuneD runs. By default (at least on RHEL) the timeout for this service seems to be 180s. Is that enough? Is that too low? That's what I'm asking. Good to see that there at least is a timeout.

On Fedora the default timeout is 120 s. It's the maximum number of seconds to wait if the queue still isn't emptied which IMHO on normal system the queue is emptied in cca. several seconds at max.

So let's say the queue is emptied in cca. 2 seconds, it means 2 seconds boot delay and after the 2 seconds the udevadm settle will return, i.e. the udevadm settle is equivalent to the sleep 2 in such case.

If the queue isn't emptied in 120 s (the default setting on Fedora if the -t isn't specified), it means there is probably something really wrong with the system and the udevadm settle call returns after 120 s and is equivalent to the sleep 120 in such case.

IMHO @zacikpa did some measurements of the boot delay on Fedora w/wo the systemd-udev-settle.service. @zacikpa do you have any usable results?

zacikpa · 2025-04-03T15:33:28Z

@yarda TBH, I only tried to measure it on machines with very few devices (say, a laptop) and there was never any delay higher than normal boot time variance.

jmencak · 2025-04-03T16:47:23Z

@yarda TBH, I only tried to measure it on machines with very few devices (say, a laptop) and there was never any delay higher than normal boot time variance.

The real test will be deployments with various network-attached storage devices. Then I suspect we'll hit cases where depolyments mostly benefit from this, but I'm sure there will be outliers where it is preferable to have partial tuning in place with a few misses. I guess we don't have a better solution right now and time will tell. What is good there at least seems to be a reasonable timeout.

yarda force-pushed the boot-udev-race-fix branch from 5e3014f to 6a29c9d Compare April 2, 2025 00:26

systemd: wait for udev to settle

56bab42

This should help with races caused by udev renaming network devices. Signed-off-by: Jaroslav Škarvada <jskarvad@redhat.com>

yarda force-pushed the boot-udev-race-fix branch from 6a29c9d to 56bab42 Compare April 2, 2025 00:33

yarda mentioned this pull request Apr 2, 2025

hotplug: fixed device remove race condition #763

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd: wait for udev to settle #762

systemd: wait for udev to settle #762

yarda commented Mar 27, 2025

zacikpa commented Mar 27, 2025

jmencak commented Mar 28, 2025 •

edited

Loading

yarda commented Mar 28, 2025

jmencak commented Mar 28, 2025

MarSik commented Apr 2, 2025 •

edited

Loading

jmencak commented Apr 2, 2025 •

edited

Loading

yarda commented Apr 2, 2025

jmencak commented Apr 2, 2025

yarda commented Apr 2, 2025

zacikpa commented Apr 3, 2025

jmencak commented Apr 3, 2025

systemd: wait for udev to settle #762

Are you sure you want to change the base?

systemd: wait for udev to settle #762

Conversation

yarda commented Mar 27, 2025

zacikpa commented Mar 27, 2025

jmencak commented Mar 28, 2025 • edited Loading

yarda commented Mar 28, 2025

jmencak commented Mar 28, 2025

MarSik commented Apr 2, 2025 • edited Loading

jmencak commented Apr 2, 2025 • edited Loading

yarda commented Apr 2, 2025

jmencak commented Apr 2, 2025

yarda commented Apr 2, 2025

zacikpa commented Apr 3, 2025

jmencak commented Apr 3, 2025

jmencak commented Mar 28, 2025 •

edited

Loading

MarSik commented Apr 2, 2025 •

edited

Loading

jmencak commented Apr 2, 2025 •

edited

Loading