sled agent: implement `OmicronZoneImageSource::Artifact` #7781

iliana · 2025-03-11T20:43:57Z

Closes #7281.

iliana · 2025-03-11T20:46:08Z

This is a draft for now. I've been able to test this manually by crafting a new input to omicron_config_put to change the image source from install_dataset to artifact with the relevant hash; the zone gets torn down and rebuilt with the correct image path:

zonecfg:oxz_cockroachdb_2e4a6e55-3bc6-4ec1-ac53-9ad2c646047d> export
create -b
set zonepath=/pool/ext/d05bafeb-90c9-4a92-8254-6d93252a1c17/crypt/zone/oxz_cockroachdb_2e4a6e55-3bc6-4ec1-ac53-9ad2c646047d
set brand=omicron1
set autoboot=false
set limitpriv=default,dtrace_proc,dtrace_user
set ip-type=exclusive
add net
set physical="oxControlService29"
end
add dataset
set name="oxp_d05bafeb-90c9-4a92-8254-6d93252a1c17/crypt/cockroachdb"
end

I don't think I can test this end-to-end until there's also code which can store the zone image source in the database.

I also want to add code to prevent Sled Agent from accepting a zone config referring to artifacts it doesn't have per #7281 (comment).

…ifact-zone

iliana · 2025-04-07T19:24:28Z

This adds checks in both directions:

Sled agent will refuse to apply a new zone configuration if the artifacts are not present on the local system yet.
Sled agent will refuse to accept a new artifact configuration if it intends to delete any artifacts used by the current zone configuration.

I'm not sure if there's a good approach for doing unit or integration testing here; this might need to be part of an end-to-end test when the other parts are in place.

jgallagher · 2025-04-11T17:54:25Z

sled-agent/src/sled_agent.rs

+                    });
+                }
+            }
+        }


This looks like a fine place for this check. A question for a "what if incoming configs were applied asynchronously via a reconciler task within sled-agent": would it be better to (a) refuse to accept and ledger the new config if the artifact isn't found (as this does), or (b) accept and ledger the new config but report errors on the specific zone(s) that can't be started? I think I still prefer (a), since if you don't have the artifact there's no guarantee you'd ever get it, but I think (b) would be just as easy to implement if there's a reason to want it.

Yeah, I think I also still prefer (a), although I'm curious about possible failure modes if the sled reboots.

Thinking out loud if we accept the config the zone will never come up on sled boot; if we reject it we'll still run the old config. I assume that (and thus (a)) is usually a better scenario?

Thinking out loud if we accept the config the zone will never come up on sled boot; if we reject it we'll still run the old config.

Correct.

Ok, here's a different kind of failure mode that maybe favors (b): what if the new config contains multiple new zones to start, and only one of them has a missing source? (a) would reject the config in its entirety, so we can't start any new zones until all of their sources exist.

I'm not sure that necessarily favors (b); we'd still be unable to bring the zone up on boot. (Also if it's replacing a zone, we'd shut down the old zone and then fail to bring up the new one, so we'd be down that zone.)

I think I'd be fine with (b) in a world where Nexus is actually taking this error into account, but if it is then maybe we should still reject the whole config and let Nexus decide how we should proceed?

In last week's update sync we agreed to keep the current implementation, particularly given that Sled Agent's current behavior is to tear down zones and then fail to bring up new zones if there's a problem.

This looks like a fine place for this check.

I'm second-guessing this; I think at this point we're not holding the lock on self.inner.zones, right? Could we have a sequence like this:

New OmicronZonesConfig comes in; we have artifacts for all its zones, so we pass this check. This config adds a new zone (call it Z).

New artifact store config comes in; we call omicron_zones_list() which takes the zones lock and checks the ledger. The new artifact store config is missing the artifact for zone Z, but we haven't yet applied the incoming config, so this check passes and we write the new artifact store config.

We resume handling the new config from step 1; we go into ensure_all_omicron_zones_persistent which acquires the zone lock, tries to start the new zones, which may now fail because the store is missing the artifact we need for zone Z.

Hm, yeah. In theory it's simple enough to move the check after the lock is taken but I will need to move the artifact store from being owned by SledAgentInner to instead be owned by ServiceManagerInner, which probably makes more sense anyway.

Or I could pass in a reference to the artifact store to that function...

I'd do whatever is most expedient, I think? I'll have to rework this some soon anyway since the way the zone ledger is written will change.

…ifact-zone

jgallagher · 2025-04-22T21:09:18Z

sled-agent/src/artifact_store.rs

+                                zone.image_source.artifact_hash()
+                            })
+                            .collect::<BTreeSet<_>>();
+                        for sha256 in difference {


I think this is fine but I had a couple thoughts reading it:

I think this assumes something else guarantees that the artifact store ledger always contain artifacts for every zone in the zone ledger, right? If somehow we had a zone that wasn't in the ledger, we wouldn't catch it being missing from new_config here.

In light of that - do we need to compute difference at all? Or could we check in_use against new_config directly?

Yes, this would only catch artifacts removed from one generation of the config to the next. It does seem to make more sense to reject any configurations that are missing in-use artifacts, probably with a clearer error message.

jgallagher · 2025-04-22T21:11:59Z

sled-agent/src/services.rs

+                zone_image_paths
+            }
+            OmicronZoneImageSource::Artifact { .. } => {
+                // Search both artifact datasets, but look on the boot disk first.


Is it important to look on the boot disk first if we're looking an artifact up by hash?

No, I don't think it's important. I think the InstallDataset match arm made me consider whether it'd be better to prefer the boot disk while still being able to fall back, and it seemed like the least bad option to me:

If we prefer the boot disk, the disk was likely functioning at some point recently. If that disk doesn't have the artifact we still fall back to the other disk.

If we pick the first disk, we will consistently get the same disk as sorted alphabetically by vendor/model/serial, based on the BTreeMap<DiskIdentity, ManagedDisk> used to keep track of disks. If that disk happens to be faulty we may be in for a bad time. (We are probably in for a bad time if either M.2 is faulty anyway though.)

jgallagher · 2025-04-22T21:17:31Z

sled-agent/src/sled_agent.rs

+                    });
+                }
+            }
+        }


This looks like a fine place for this check.

I'm second-guessing this; I think at this point we're not holding the lock on self.inner.zones, right? Could we have a sequence like this:

New OmicronZonesConfig comes in; we have artifacts for all its zones, so we pass this check. This config adds a new zone (call it Z).

New artifact store config comes in; we call omicron_zones_list() which takes the zones lock and checks the ledger. The new artifact store config is missing the artifact for zone Z, but we haven't yet applied the incoming config, so this check passes and we write the new artifact store config.

We resume handling the new config from step 1; we go into ensure_all_omicron_zones_persistent which acquires the zone lock, tries to start the new zones, which may now fail because the store is missing the artifact we need for zone Z.

iliana · 2025-04-23T20:57:58Z

I moved the check for artifacts on a new zone configuration after the lock is taken to avoid the described sequence. Currently checking for artifacts requires the ArtifactStore which is on the SledAgent but not the ServicesManager, but ServicesManager has a StorageHandle and checking for an artifact really only requires the storage handle, so it was most expedient to add an associated method that does the same thing without having a reference to the ArtifactStore.

Also changed the check for new artifact configurations to always reject if an in-use artifact is missing. I also changed the error to contain the list of all missing artifacts along with their reasons (I assume this list of checks should be expanded on later.)

well that should just about do it

9946724

appease rustdoc

7635163

davepacheco mentioned this pull request Mar 13, 2025

TUF artifact distribution errors in sled agent log #7796

Closed

iliana added 3 commits April 7, 2025 18:11

Merge remote-tracking branch 'origin/main' into iliana/welcome-to-art…

d615a9b

…ifact-zone

prevent starting zones if we don't have the artifact

923ac64

prevent deleting artifacts if they're used in zones

3752c1c

iliana marked this pull request as ready for review April 7, 2025 19:23

jgallagher reviewed Apr 11, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into iliana/welcome-to-art…

def781b

…ifact-zone

jgallagher reviewed Apr 22, 2025

View reviewed changes

review comments

ae36ddf

jgallagher approved these changes Apr 24, 2025

View reviewed changes

iliana merged commit b2467e3 into main Apr 25, 2025
16 checks passed

iliana deleted the iliana/welcome-to-artifact-zone branch April 25, 2025 02:27

sled agent: implement OmicronZoneImageSource::Artifact #7781

sled agent: implement OmicronZoneImageSource::Artifact #7781

Uh oh!

Conversation

iliana commented Mar 11, 2025

Uh oh!

iliana commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iliana commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iliana commented Apr 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sled agent: implement `OmicronZoneImageSource::Artifact` #7781

sled agent: implement `OmicronZoneImageSource::Artifact` #7781

iliana commented Mar 11, 2025 •

edited

Loading

iliana commented Apr 7, 2025 •

edited

Loading