Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Avoid races on service import, properties on instances, explicit refresh #1124

Merged
merged 2 commits into from
May 27, 2022

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented May 26, 2022

Fixes #1115 , hopefully.

Previously

I used the following script:

set -x
oxide org     create myorg -D "description"
oxide project create --organization myorg myproject -D "description"
set +x

set -x
for i in {0..10000}; do
  INSTANCE_NAME="myinstance${i}"
  oxide instance create --organization myorg --project myproject ${INSTANCE_NAME} -D "description" -c 2 -m 1073741824 --hostname myinstance
  oxide instance stop   --confirm --organization myorg --project myproject ${INSTANCE_NAME}
  oxide instance delete --confirm --organization myorg --project myproject ${INSTANCE_NAME}
done
set +x

After < 10 iterations, I'd typically see an SMF failure, either due to a missing property, or due to some issue on import.

After discussion with @bnaecker and @rmustacc , we confirmed that the Propolis Server zone automatically imports the manifest, and that it may be racing with an explicit import.

This PR

This PR performs the following actions:

  1. Rather than importing the manifest, we simply try to add a service instance in a retry loop. As seen in the log files, this typically fails for the first couple invocations (with the output svccfg: Pattern 'svc:/system/illumos/propolis- server' doesn't match any instances or services), but succeeds once the built-in configd service has finished importing the propolis manifest.
  2. It sets configuration parameters on the service instance, rather than the service itself.
  3. It explicitly refreshes the instance before enabling it, to make sure the properties are up-to-date.

The aforementioned "create an instance, stop it, destroy it" script has been running on my machine for nearly 100 iterations without a single failure using this PR, which is a major improvement.

@smklein smklein requested a review from bnaecker May 26, 2022 22:40
"add",
&format!("vm-{}", inner.propolis_id()),
&smf_instance_name,
"refresh",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had talked about putting an additional check here, that the config/server_addr property has actually taken for the instance. Do you think that's overly paranoid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm okay omitting it, based on the data collected (500+ iterations without the server_addr missing), but I'd be happy to add it later if we see any issues?

)
.await?;

info!(inner.log, "Adding service property group");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the service name to these log messages? The actual address might also be nice, in the svccfg setprop call below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more context. These all have the instance ID as part of the context already, so the log messages look like:

[2022-05-27T00:08:00.58389195Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Adding service property group (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.617921961Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Setting server address property (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.649773537Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Refreshing instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.675333382Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Enabling instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)   

This should make it easy to disambiguate if multiple instances are being allocated simultaneously

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! Looks good, and sounds like you've not hit any similar races after a while through the instance creation loop. I have a couple of comments, but otherwise 👍 .

@bnaecker
Copy link
Collaborator

For reference, this should close #1115.

@smklein smklein enabled auto-merge (squash) May 27, 2022 14:28
@smklein smklein merged commit c752682 into main May 27, 2022
@smklein smklein deleted the fix-smf branch May 27, 2022 15:22
leftwo pushed a commit that referenced this pull request Feb 9, 2024
Crucible changes:
Remove unused fields in IOop (#1149)
New downstairs clone subcommand. (#1129)
Simplify the do_work_task loop (#1150)
Move `Guest` stuff into a module (#1125)
Bump nix to 0.27.1 and use new safer Fd APIs (#1110)
Move `FramedWrite` work to a separate task (#1145)
Use fewer borrows in ExtentInner API (#1147)
Update Rust crate reedline to 0.28.0 (#1141)
Update Rust crate tokio to 1.36 (#1143)
Update Rust crate slog-bunyan to 2.5.0 (#1139)
Update Rust crate rayon to 1.8.1 (#1138)
Update Rust crate itertools to 0.12.1 (#1137)
Update Rust crate byte-unit to 5.1.4 (#1136)
Update Rust crate base64 to 0.21.7 (#1135)
Update Rust crate async-trait to 0.1.77 (#1134)
Discard deferred msgs (#1131)
Minor Downstairs cleanup (#1127)
Update test_fail_live_repair to support pstop (#1128)
Ignore client messages after stopping the IO task (#1126)
Move client IO task into a struct (#1124)
Bump Rust to 1.75 and fix new Clippy lints (#1123)

Propolis changes:
PHD: convert to async (#633)
PHD: assume specialized Windows images (#636)
propolis-standalone-config needn't be a crate
standalone: Use tar for snapshot/restore
phd: use latest "lab-2.0-opte" target, not a specific version (#637)
PHD: add tests for migration of running processes (#623)
PHD: fix `cargo xtask phd` tidy not doing anything (#630)
PHD: add documentation for `cargo xtask phd` (#629)
standalone: improve virtual device creation errors (#632)
phd: add Windows Server 2019 guest adapter (#627)
PHD: add `cargo xtask phd` to make using PHD nicer (#619)
leftwo added a commit that referenced this pull request Feb 9, 2024
Crucible changes:
Remove unused fields in IOop (#1149)
New downstairs clone subcommand. (#1129)
Simplify the do_work_task loop (#1150)
Move `Guest` stuff into a module (#1125)
Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite`
work to a separate task (#1145) Use fewer borrows in ExtentInner API
(#1147)
Update Rust crate reedline to 0.28.0 (#1141)
Update Rust crate tokio to 1.36 (#1143)
Update Rust crate slog-bunyan to 2.5.0 (#1139)
Update Rust crate rayon to 1.8.1 (#1138)
Update Rust crate itertools to 0.12.1 (#1137)
Update Rust crate byte-unit to 5.1.4 (#1136)
Update Rust crate base64 to 0.21.7 (#1135)
Update Rust crate async-trait to 0.1.77 (#1134)
Discard deferred msgs (#1131)
Minor Downstairs cleanup (#1127)
Update test_fail_live_repair to support pstop (#1128) Ignore client
messages after stopping the IO task (#1126) Move client IO task into a
struct (#1124)
Bump Rust to 1.75 and fix new Clippy lints (#1123)

Propolis changes:
PHD: convert to async (#633)
PHD: assume specialized Windows images (#636)
propolis-standalone-config needn't be a crate
standalone: Use tar for snapshot/restore
phd: use latest "lab-2.0-opte" target, not a specific version (#637)
PHD: add tests for migration of running processes (#623) PHD: fix `cargo
xtask phd` tidy not doing anything (#630) PHD: add documentation for
`cargo xtask phd` (#629) standalone: improve virtual device creation
errors (#632) phd: add Windows Server 2019 guest adapter (#627)
PHD: add `cargo xtask phd` to make using PHD nicer (#619)

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[sled-agent] Propolis server SMF service listen address is not always set correctly
2 participants