Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Avoid races on service import, properties on instances, explicit refresh #1124

Merged
merged 2 commits into from
May 27, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 56 additions & 11 deletions sled-agent/src/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
use crate::common::instance::{
Action as InstanceAction, InstanceStates, PROPOLIS_PORT,
};
use crate::illumos::running_zone::{InstalledZone, RunningZone};
use crate::illumos::running_zone::{
InstalledZone, RunCommandError, RunningZone,
};
use crate::illumos::svc::wait_for_service;
use crate::illumos::vnic::VnicAllocator;
use crate::illumos::zone::{AddressRequest, PROPOLIS_ZONE_PREFIX};
Expand Down Expand Up @@ -516,37 +518,80 @@ impl Instance {
info!(inner.log, "Created address {} for zone: {}", network, zname);

// Run Propolis in the Zone.
let smf_service_name = "svc:/system/illumos/propolis-server";
let instance_name = format!("vm-{}", inner.propolis_id());
let smf_instance_name =
format!("{}:{}", smf_service_name, instance_name);
let server_addr = SocketAddr::new(inner.propolis_ip, PROPOLIS_PORT);

// We intentionally do not import the service - it is placed under
// `/var/svc/manifest`, and should automatically be imported by
// configd.
//
// Insteady, we re-try adding the instance until it succeeds.
// This implies that the service was added successfully.
info!(
inner.log,
"Adding {} as a {} service", &instance_name, smf_service_name
);
backoff::retry_notify(
backoff::internal_service_policy(),
|| async {
running_zone
.run_cmd(&[
crate::illumos::zone::SVCCFG,
"-s",
smf_service_name,
"add",
&instance_name,
])
.map_err(|e| backoff::BackoffError::transient(e))
},
|err: RunCommandError, delay| {
warn!(
inner.log,
"Failed to add {} as a service (retrying in {:?}): {}",
instance_name,
delay,
err.to_string()
);
},
)
.await?;

info!(inner.log, "Adding service property group");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the service name to these log messages? The actual address might also be nice, in the svccfg setprop call below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more context. These all have the instance ID as part of the context already, so the log messages look like:

[2022-05-27T00:08:00.58389195Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Adding service property group (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.617921961Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Setting server address property (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.649773537Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Refreshing instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.675333382Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Enabling instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)   

This should make it easy to disambiguate if multiple instances are being allocated simultaneously

running_zone.run_cmd(&[
crate::illumos::zone::SVCCFG,
"import",
"/var/svc/manifest/site/propolis-server/manifest.xml",
"-s",
&smf_instance_name,
"addpg",
"config",
"astring",
])?;

info!(inner.log, "Setting server address property");
running_zone.run_cmd(&[
crate::illumos::zone::SVCCFG,
"-s",
"system/illumos/propolis-server",
&smf_instance_name,
"setprop",
&format!("config/server_addr={}", server_addr),
])?;

info!(inner.log, "Refreshing instance");
running_zone.run_cmd(&[
crate::illumos::zone::SVCCFG,
"-s",
"svc:/system/illumos/propolis-server",
"add",
&format!("vm-{}", inner.propolis_id()),
&smf_instance_name,
"refresh",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had talked about putting an additional check here, that the config/server_addr property has actually taken for the instance. Do you think that's overly paranoid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm okay omitting it, based on the data collected (500+ iterations without the server_addr missing), but I'd be happy to add it later if we see any issues?

])?;

info!(inner.log, "Enabling instance");
running_zone.run_cmd(&[
crate::illumos::zone::SVCADM,
"enable",
"-t",
&format!(
"svc:/system/illumos/propolis-server:vm-{}",
inner.propolis_id()
),
&smf_instance_name,
])?;

info!(inner.log, "Started propolis in zone: {}", zname);
Expand Down