Skip to content

Conversation

@iliana
Copy link
Contributor

@iliana iliana commented Mar 11, 2025

Closes #7281.

@iliana
Copy link
Contributor Author

iliana commented Mar 11, 2025

This is a draft for now. I've been able to test this manually by crafting a new input to omicron_config_put to change the image source from install_dataset to artifact with the relevant hash; the zone gets torn down and rebuilt with the correct image path:

zonecfg:oxz_cockroachdb_2e4a6e55-3bc6-4ec1-ac53-9ad2c646047d> export
create -b
set zonepath=/pool/ext/d05bafeb-90c9-4a92-8254-6d93252a1c17/crypt/zone/oxz_cockroachdb_2e4a6e55-3bc6-4ec1-ac53-9ad2c646047d
set brand=omicron1
set autoboot=false
set limitpriv=default,dtrace_proc,dtrace_user
set ip-type=exclusive
add net
set physical="oxControlService29"
end
add dataset
set name="oxp_d05bafeb-90c9-4a92-8254-6d93252a1c17/crypt/cockroachdb"
end

I don't think I can test this end-to-end until there's also code which can store the zone image source in the database.

I also want to add code to prevent Sled Agent from accepting a zone config referring to artifacts it doesn't have per #7281 (comment).

@iliana iliana marked this pull request as ready for review April 7, 2025 19:23
@iliana
Copy link
Contributor Author

iliana commented Apr 7, 2025

This adds checks in both directions:

  1. Sled agent will refuse to apply a new zone configuration if the artifacts are not present on the local system yet.
  2. Sled agent will refuse to accept a new artifact configuration if it intends to delete any artifacts used by the current zone configuration.

I'm not sure if there's a good approach for doing unit or integration testing here; this might need to be part of an end-to-end test when the other parts are in place.

});
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fine place for this check. A question for a "what if incoming configs were applied asynchronously via a reconciler task within sled-agent": would it be better to (a) refuse to accept and ledger the new config if the artifact isn't found (as this does), or (b) accept and ledger the new config but report errors on the specific zone(s) that can't be started? I think I still prefer (a), since if you don't have the artifact there's no guarantee you'd ever get it, but I think (b) would be just as easy to implement if there's a reason to want it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I also still prefer (a), although I'm curious about possible failure modes if the sled reboots.

Thinking out loud if we accept the config the zone will never come up on sled boot; if we reject it we'll still run the old config. I assume that (and thus (a)) is usually a better scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud if we accept the config the zone will never come up on sled boot; if we reject it we'll still run the old config.

Correct.

Ok, here's a different kind of failure mode that maybe favors (b): what if the new config contains multiple new zones to start, and only one of them has a missing source? (a) would reject the config in its entirety, so we can't start any new zones until all of their sources exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that necessarily favors (b); we'd still be unable to bring the zone up on boot. (Also if it's replacing a zone, we'd shut down the old zone and then fail to bring up the new one, so we'd be down that zone.)

I think I'd be fine with (b) in a world where Nexus is actually taking this error into account, but if it is then maybe we should still reject the whole config and let Nexus decide how we should proceed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In last week's update sync we agreed to keep the current implementation, particularly given that Sled Agent's current behavior is to tear down zones and then fail to bring up new zones if there's a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fine place for this check.

I'm second-guessing this; I think at this point we're not holding the lock on self.inner.zones, right? Could we have a sequence like this:

  1. New OmicronZonesConfig comes in; we have artifacts for all its zones, so we pass this check. This config adds a new zone (call it Z).
  2. New artifact store config comes in; we call omicron_zones_list() which takes the zones lock and checks the ledger. The new artifact store config is missing the artifact for zone Z, but we haven't yet applied the incoming config, so this check passes and we write the new artifact store config.
  3. We resume handling the new config from step 1; we go into ensure_all_omicron_zones_persistent which acquires the zone lock, tries to start the new zones, which may now fail because the store is missing the artifact we need for zone Z.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah. In theory it's simple enough to move the check after the lock is taken but I will need to move the artifact store from being owned by SledAgentInner to instead be owned by ServiceManagerInner, which probably makes more sense anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I could pass in a reference to the artifact store to that function...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do whatever is most expedient, I think? I'll have to rework this some soon anyway since the way the zone ledger is written will change.

zone.image_source.artifact_hash()
})
.collect::<BTreeSet<_>>();
for sha256 in difference {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine but I had a couple thoughts reading it:

  1. I think this assumes something else guarantees that the artifact store ledger always contain artifacts for every zone in the zone ledger, right? If somehow we had a zone that wasn't in the ledger, we wouldn't catch it being missing from new_config here.
  2. In light of that - do we need to compute difference at all? Or could we check in_use against new_config directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would only catch artifacts removed from one generation of the config to the next. It does seem to make more sense to reject any configurations that are missing in-use artifacts, probably with a clearer error message.

zone_image_paths
}
OmicronZoneImageSource::Artifact { .. } => {
// Search both artifact datasets, but look on the boot disk first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to look on the boot disk first if we're looking an artifact up by hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think it's important. I think the InstallDataset match arm made me consider whether it'd be better to prefer the boot disk while still being able to fall back, and it seemed like the least bad option to me:

  • If we prefer the boot disk, the disk was likely functioning at some point recently. If that disk doesn't have the artifact we still fall back to the other disk.
  • If we pick the first disk, we will consistently get the same disk as sorted alphabetically by vendor/model/serial, based on the BTreeMap<DiskIdentity, ManagedDisk> used to keep track of disks. If that disk happens to be faulty we may be in for a bad time. (We are probably in for a bad time if either M.2 is faulty anyway though.)

});
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fine place for this check.

I'm second-guessing this; I think at this point we're not holding the lock on self.inner.zones, right? Could we have a sequence like this:

  1. New OmicronZonesConfig comes in; we have artifacts for all its zones, so we pass this check. This config adds a new zone (call it Z).
  2. New artifact store config comes in; we call omicron_zones_list() which takes the zones lock and checks the ledger. The new artifact store config is missing the artifact for zone Z, but we haven't yet applied the incoming config, so this check passes and we write the new artifact store config.
  3. We resume handling the new config from step 1; we go into ensure_all_omicron_zones_persistent which acquires the zone lock, tries to start the new zones, which may now fail because the store is missing the artifact we need for zone Z.

@iliana
Copy link
Contributor Author

iliana commented Apr 23, 2025

I moved the check for artifacts on a new zone configuration after the lock is taken to avoid the described sequence. Currently checking for artifacts requires the ArtifactStore which is on the SledAgent but not the ServicesManager, but ServicesManager has a StorageHandle and checking for an artifact really only requires the storage handle, so it was most expedient to add an associated method that does the same thing without having a reference to the ArtifactStore.

Also changed the check for new artifact configurations to always reject if an in-use artifact is missing. I also changed the error to contain the list of all missing artifacts along with their reasons (I assume this list of checks should be expanded on later.)

@iliana iliana merged commit b2467e3 into main Apr 25, 2025
16 checks passed
@iliana iliana deleted the iliana/welcome-to-artifact-zone branch April 25, 2025 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sled Agent: PUT /omicron-zones needs a way to specify TUF artifact for zone image

3 participants