-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Avoid races on service import, properties on instances, explicit refresh #1124
Conversation
… explicit refresh
"add", | ||
&format!("vm-{}", inner.propolis_id()), | ||
&smf_instance_name, | ||
"refresh", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had talked about putting an additional check here, that the config/server_addr
property has actually taken for the instance. Do you think that's overly paranoid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm okay omitting it, based on the data collected (500+ iterations without the server_addr missing), but I'd be happy to add it later if we see any issues?
sled-agent/src/instance.rs
Outdated
) | ||
.await?; | ||
|
||
info!(inner.log, "Adding service property group"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the service name to these log messages? The actual address might also be nice, in the svccfg setprop
call below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add more context. These all have the instance ID as part of the context already, so the log messages look like:
[2022-05-27T00:08:00.58389195Z] INFO: SledAgent/InstanceManager/6692 on thelio: Adding service property group (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.617921961Z] INFO: SledAgent/InstanceManager/6692 on thelio: Setting server address property (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.649773537Z] INFO: SledAgent/InstanceManager/6692 on thelio: Refreshing instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.675333382Z] INFO: SledAgent/InstanceManager/6692 on thelio: Enabling instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
This should make it easy to disambiguate if multiple instances are being allocated simultaneously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on! Looks good, and sounds like you've not hit any similar races after a while through the instance creation loop. I have a couple of comments, but otherwise 👍 .
For reference, this should close #1115. |
Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619)
Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619) Co-authored-by: Alan Hanson <alan@oxide.computer>
Fixes #1115 , hopefully.
Previously
I used the following script:
After < 10 iterations, I'd typically see an SMF failure, either due to a missing property, or due to some issue on import.
After discussion with @bnaecker and @rmustacc , we confirmed that the Propolis Server zone automatically imports the manifest, and that it may be racing with an explicit import.
This PR
This PR performs the following actions:
svccfg: Pattern 'svc:/system/illumos/propolis- server' doesn't match any instances or services
), but succeeds once the built-inconfigd
service has finished importing the propolis manifest.The aforementioned "create an instance, stop it, destroy it" script has been running on my machine for nearly 100 iterations without a single failure using this PR, which is a major improvement.