-
Notifications
You must be signed in to change notification settings - Fork 63
sled agent: implement OmicronZoneImageSource::Artifact
#7781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is a draft for now. I've been able to test this manually by crafting a new input to I don't think I can test this end-to-end until there's also code which can store the zone image source in the database. I also want to add code to prevent Sled Agent from accepting a zone config referring to artifacts it doesn't have per #7281 (comment). |
|
This adds checks in both directions:
I'm not sure if there's a good approach for doing unit or integration testing here; this might need to be part of an end-to-end test when the other parts are in place. |
sled-agent/src/sled_agent.rs
Outdated
| }); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fine place for this check. A question for a "what if incoming configs were applied asynchronously via a reconciler task within sled-agent": would it be better to (a) refuse to accept and ledger the new config if the artifact isn't found (as this does), or (b) accept and ledger the new config but report errors on the specific zone(s) that can't be started? I think I still prefer (a), since if you don't have the artifact there's no guarantee you'd ever get it, but I think (b) would be just as easy to implement if there's a reason to want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I also still prefer (a), although I'm curious about possible failure modes if the sled reboots.
Thinking out loud if we accept the config the zone will never come up on sled boot; if we reject it we'll still run the old config. I assume that (and thus (a)) is usually a better scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud if we accept the config the zone will never come up on sled boot; if we reject it we'll still run the old config.
Correct.
Ok, here's a different kind of failure mode that maybe favors (b): what if the new config contains multiple new zones to start, and only one of them has a missing source? (a) would reject the config in its entirety, so we can't start any new zones until all of their sources exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that necessarily favors (b); we'd still be unable to bring the zone up on boot. (Also if it's replacing a zone, we'd shut down the old zone and then fail to bring up the new one, so we'd be down that zone.)
I think I'd be fine with (b) in a world where Nexus is actually taking this error into account, but if it is then maybe we should still reject the whole config and let Nexus decide how we should proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In last week's update sync we agreed to keep the current implementation, particularly given that Sled Agent's current behavior is to tear down zones and then fail to bring up new zones if there's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fine place for this check.
I'm second-guessing this; I think at this point we're not holding the lock on self.inner.zones, right? Could we have a sequence like this:
- New
OmicronZonesConfigcomes in; we have artifacts for all its zones, so we pass this check. This config adds a new zone (call it Z). - New artifact store config comes in; we call
omicron_zones_list()which takes the zones lock and checks the ledger. The new artifact store config is missing the artifact for zone Z, but we haven't yet applied the incoming config, so this check passes and we write the new artifact store config. - We resume handling the new config from step 1; we go into
ensure_all_omicron_zones_persistentwhich acquires the zone lock, tries to start the new zones, which may now fail because the store is missing the artifact we need for zone Z.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah. In theory it's simple enough to move the check after the lock is taken but I will need to move the artifact store from being owned by SledAgentInner to instead be owned by ServiceManagerInner, which probably makes more sense anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I could pass in a reference to the artifact store to that function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do whatever is most expedient, I think? I'll have to rework this some soon anyway since the way the zone ledger is written will change.
sled-agent/src/artifact_store.rs
Outdated
| zone.image_source.artifact_hash() | ||
| }) | ||
| .collect::<BTreeSet<_>>(); | ||
| for sha256 in difference { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine but I had a couple thoughts reading it:
- I think this assumes something else guarantees that the artifact store ledger always contain artifacts for every zone in the zone ledger, right? If somehow we had a zone that wasn't in the ledger, we wouldn't catch it being missing from
new_confighere. - In light of that - do we need to compute
differenceat all? Or could we checkin_useagainstnew_configdirectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would only catch artifacts removed from one generation of the config to the next. It does seem to make more sense to reject any configurations that are missing in-use artifacts, probably with a clearer error message.
| zone_image_paths | ||
| } | ||
| OmicronZoneImageSource::Artifact { .. } => { | ||
| // Search both artifact datasets, but look on the boot disk first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to look on the boot disk first if we're looking an artifact up by hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think it's important. I think the InstallDataset match arm made me consider whether it'd be better to prefer the boot disk while still being able to fall back, and it seemed like the least bad option to me:
- If we prefer the boot disk, the disk was likely functioning at some point recently. If that disk doesn't have the artifact we still fall back to the other disk.
- If we pick the first disk, we will consistently get the same disk as sorted alphabetically by vendor/model/serial, based on the
BTreeMap<DiskIdentity, ManagedDisk>used to keep track of disks. If that disk happens to be faulty we may be in for a bad time. (We are probably in for a bad time if either M.2 is faulty anyway though.)
sled-agent/src/sled_agent.rs
Outdated
| }); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fine place for this check.
I'm second-guessing this; I think at this point we're not holding the lock on self.inner.zones, right? Could we have a sequence like this:
- New
OmicronZonesConfigcomes in; we have artifacts for all its zones, so we pass this check. This config adds a new zone (call it Z). - New artifact store config comes in; we call
omicron_zones_list()which takes the zones lock and checks the ledger. The new artifact store config is missing the artifact for zone Z, but we haven't yet applied the incoming config, so this check passes and we write the new artifact store config. - We resume handling the new config from step 1; we go into
ensure_all_omicron_zones_persistentwhich acquires the zone lock, tries to start the new zones, which may now fail because the store is missing the artifact we need for zone Z.
|
I moved the check for artifacts on a new zone configuration after the lock is taken to avoid the described sequence. Currently checking for artifacts requires the Also changed the check for new artifact configurations to always reject if an in-use artifact is missing. I also changed the error to contain the list of all missing artifacts along with their reasons (I assume this list of checks should be expanded on later.) |
Closes #7281.