-
Notifications
You must be signed in to change notification settings - Fork 50
Description
When sled-agent receives the set of zones it should be running, it attempts to remove any zones that are currently running and are not part of the new set, but failure to remove such zones are only logged and do not affect the result of the API request:
omicron/sled-agent/src/services.rs
Lines 3547 to 3550 in 13d411f
// Attempts to take a zone bundle and remove a zone. | |
// | |
// Logs, but does not return an error on failure. | |
async fn zone_bundle_and_try_remove( |
Additionally, sled-agent only attempts to remove a zone exactly once. The first thing zone_bundle_and_try_remove
does is remove the zone from the in-memory list:
omicron/sled-agent/src/services.rs
Lines 3557 to 3564 in 13d411f
let Some(mut zone) = existing_zones.remove(&expected_zone_name) else { | |
warn!( | |
log, | |
"Expected to remove zone, but could not find it"; | |
"zone_name" => &expected_zone_name, | |
); | |
return; | |
}; |
It then tries to shut down the zone, logs any errors, and (later, after trying to start any new zones), records that in-memory list back into the ledger. That means we have this flow, where even if Nexus resends the PUT
sled-agent won't retry the removal:
- sled-agent is currently on omicron-zones generation N
- Nexus sends a
PUT
with generation N + 1 that removes one zone- sled-agent tries to shut down that one zone, but proceeds regardless of success
- sled-agent records that it is now on generation N + 1
- Nexus sends another
PUT
request with generation N + 1; sled-agent thinks it's already there, so does nothing
@sunshowers and I were chatting in the context of reconfigurator / zone cleanup, and we think it may be important to fail this request if any zone removals fail. A motivating example is removing an expunged Nexus and reconfigurator deciding whether any sagas assigned to that Nexus have been reassigned. Reconfigurator needs to check two things:
- Is there a guarantee that no new saga assignments will be claimed by the expunged Nexus?
- Have any sagas assigned to the expunged Nexus been reassigned?
Critically: seeing that there are no sagas assigned to the expunged Nexus (item 2) alone is insufficient, since it's inherently racy if new assignments could still be made. It must first ensure item 1.
The working plan for reconfigurator is that it will base "has a zone actually been expunged" by inspecting the regularly-polled inventory. sled-agent reports its current omicron-zones config to inventory, but based on the above, this might be a lie if zone removal failed. "Zone expunged and inventory reports zone is gone" should be enough to satisfy item 1 above for Nexus and saga assignment, but isn't if sled-agent lies: the Nexus that sled-agent claims isn't running but actually is could continue to claim new sagas.
The sled-agent behavior was clearly intentional, and I haven't tried to dig into the history to figure out why. Can we change it to fail the PUT
(and, critically, not update its ledger) if zone removal fails, instead of only logging?