Skip to content

Conversation

@koooosh
Copy link
Contributor

@koooosh koooosh commented Jan 16, 2026

Issue number:

Related to: bottlerocket-os/bottlerocket#4556

Description of changes:

Improves thar-be-settings service restart behavior with:

  1. Batch systemd restarts: All systemd units are now restarted with a single systemctl try-reload-or-restart call, making the restart atomic from systemd's perspective.

  2. No early bail: All service restarts are attempted regardless of individual failures so the system is not left in a partially-configured state. Failures are logged as they occur, and a summary error is returned at the end.

There are various ways to implement this, but I focused on keeping restart_services as simple as possible. This meant making symmetrical "flows" for systemd and non-systemd services so the restart process is simply:

Restart systemd -> Restart non-systemd -> Report any failures

Since systemd services are batched, restart_systemd method needs to be part of the Services impl. I removed the ServiceRestart trait since it was not used for abstraction (only one implementor) and moved the individual service restart logic to restart_non_systemd under Service impl.

Testing done:

Launched an aws-dev ami with thar-be-settings debug logs enabled and a forced failure in corndog to demonstrate the no-bail behavior. Instance failed boot as expected since settings failed to apply, but journal logs show:

  1. Systemd batch ✅
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restarting systemd units: ["containerd.service", "host-containerd.service", "metricdog.service", "cfsignal.service", "chronyd.service", "docker.service", "deprecation-warning@log4j-hotpatch-enabled.timer", "systemd-modules-load", "set-hostname.service"]
  1. All service restarts attempted despite failures ✅
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "/usr/bin/bootstrap-containers create-containers"
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "/usr/bin/corndog lockdown"
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [ERROR] Failed to restart lockdown: Restart command failed - '/usr/bin/corndog lockdown':
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "/usr/bin/certdog"
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "/usr/bin/prairiedog generate-boot-config"
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "/usr/bin/corndog sysctl"
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [ERROR] Failed to restart sysctl: Restart command failed - '/usr/bin/corndog sysctl':
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "/usr/bin/host-containers"
Jan 16 02:54:09 localhost thar-be-settings[1779]: 02:54:09 [DEBUG] (1) thar_be_settings::service: Restart command: "netdog write-resolv-conf"
Jan 16 02:54:09 localhost thar-be-settings[1779]: One or more service restarts failed

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Signed-off-by: Kush Upadhyay <kushupad@amazon.com>
@koooosh koooosh force-pushed the thar-settings-update branch from f0e2d8a to 2283cba Compare January 16, 2026 04:07
@koooosh
Copy link
Contributor Author

koooosh commented Jan 16, 2026

^ Force push fixes a minor fmt warning from builds

fn restart_non_systemd(&self) -> Result<()> {
for restart_command in &self.model.restart_commands {
// Skip systemd commands - handled by Services::restart_systemd()
if restart_command.contains("systemctl") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want contains or starts_with?

/// Execute non-systemd restart commands for this service.
fn restart_non_systemd(&self) -> Result<()> {
for restart_command in &self.model.restart_commands {
// Skip systemd commands - handled by Services::restart_systemd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of a shame we have to sort this twice

.0
.values()
.flat_map(|s| s.model.restart_commands.iter())
.filter(|cmd| cmd.contains("systemctl"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, why not starts_with if we're assuming the first word is the command?

Comment on lines +238 to +240
for (name, service) in &services.0 {
if let Err(e) = service.restart_non_systemd() {
error!("Failed to restart {}: {}", name, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems cleaner to partition Services into two sets, one for systemd and one for non-systemd. These could be represented by different types and each could have a restart method.

Comment on lines +62 to +65
debug!("Restarting systemd units: {:?}", &units);
let mut cmd = Command::new("/bin/systemctl");
cmd.arg("try-reload-or-restart");
cmd.args(&units);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't automatically treat all systemctl ... commands as wanting systemctl try-reload-or-restart ... instead.

We should assume the restart command must be issued exactly as specified. The only optimization permitted is to group the restart commands by subcommand - all systemctl restart ... together, all systemctl try-restart ... together, and so on.

It might also be prudent to lean into the non-determinism for restarts - for example, preparing the entire batch of restart commands, then randomizing the order. There may be some brittleness from potential ordering surprises that needs to be shaken out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants