-
Notifications
You must be signed in to change notification settings - Fork 63
add CLI for executing blueprints by hand #7801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(edit: I've updated this PR and the description above to reflect the changes described here)
While commenting on the safety bits below, I realized a different way to phrase this tool that could be much safer. Right now, it's phrased as "execute whatever blueprint you want". It could instead be phrased as: "execute the system's current target blueprint, but read it from this file instead of the database". Or even "execute a specific blueprint that was at least at one time a target for this system, but read it from this file instead of the database".
This would still support the use cases I mentioned above:
- for development, it would be fine if you had to use reconfigurator-cli to make your custom blueprint, load it into Nexus, make it the target, and then use this tool
- in the production case I mentioned, we only wanted to execute the system's existing target blueprint anyway
- you could also use this to test execution of older blueprints, which is always supposed to be safe (and you could more easily test that with this tool)
Generally this should be pretty safe because you can't fork the linear history (mostly). That is, you can't execute a blueprint that creates a new generation N on a sled agent that Nexus doesn't know about. It does know about it. But it's still a little dicey in that the contents of generation N on the sled might diverge from the contents that Nexus wants to send it (because this tool and Nexus are interpreting that generation differently). That would produce errors during execution. This would get resolved if ever Nexus had to bump the generation for some other reason. The specific behavior and impact would depend on how the blueprint differed between Nexus and the tool.
At the very least, this seems like a big improvement.
…s/reconfig-exec-cli
…s/reconfig-exec-cli
| pub nexus_id: Option<OmicronZoneUuid>, | ||
| pub creator: OmicronZoneUuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we were using nexus_id in three places:
- as the "creator" field for DNS records
- to assign ourselves sagas and support bundles from any expunged Nexus instances
I've separated these out. Now we use creator for the first one. This must always be specified. We use nexus_id for the other two. You can leave this unspecified, in which case execution will skip these steps (because it makes no sense for this tool to assign itself sagas or support bundles).
I think creator could be a free-form string but I wasn't sure if some stuff might assume it's a uuid so I did what felt like the conservative thing and kept it a uuid.
…ts/reconfig-exec-cli
…ts/reconfig-exec-cli
|
On the latest commit (aa4cca5), I retested the easy cases. Test case: a blueprint that was previously a target: Test case: a blueprint that was never a target: Finally, the current target: Note the two steps skipped because "not running as Nexus". I still plan to retest:
|
|
Live tests on an a4x2 built with this change (live tests from #7823): |
|
I re-did the same thing I did at Friday's demo, using this tool to change the image of a currently-running pantry zone.
Then I generated a blueprint from the current target that points the pantry zone on g0 at its image from this TUF repo: Here's the state of that zone at this point (up a few hours, running bits from the install dataset): Make the change: Good. Now for real: After: That's the right digest for the new image. Great! Note this is all on an a4x2 with both this PR and #7281. |
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - 👍 on the binary name change
This adds
reconfigurator-exec: a command line tool that can executea specific blueprint from a reconfigurator save file (output bythe system's current target blueprint, as read from a saved blueprint file rather than from the database.reconfigurator-clioromdb reconfigurator export)The real reason I made this is to be able to work on / test / demo parts of blueprint execution where we haven't implemented the database serialization yet for that part of the blueprint. That sounds cheesy but it's useful for two different things today (using an artifact for a zone image source and doing RoT updates using #7741). For the latter, the in-memory representation is evolving a fair bit as I work on execution and it would be quite a lot slower (and pointless) if I had to keep the database part in sync in the meantime.
This tool is very a little dangerous because if you use it to execute a blueprint that is neither the system's current target nor a previous one, you could fork what's supposed to be a linear history. Concretely: if this deploys a blueprint that moves some sled's Omicron zones from generation 5 to generation 6, and someone in Nexus also generates a blueprint going from 5 to 6, but it's a different generation 6, that'd be very bad. All kinds of problems are possible. Many of them are fixable but it's still super dangerous if applied to a system you care about. I'd welcome any suggestions for safeties here.This tool is a little dangerous because it can deploy something to, say, sled agents that's a little different than what Nexus is deploying for the same blueprint. One of these will "win" on a per-sled basis. The other will see blueprint execution errors. This would generally get resolved if ever Nexus had to bump the associated generation for some other reason. The specific behavior and impact would depend on how the blueprint differed between Nexus and this tool.
This is really intended for the development use case above, though we were recently discussing a case where this could be useful in production, which is that if a Scrimlet fails and then the rack cold-starts: currently, Nexus won't come up because it will find two Dendrites in DNS but won't be able to reach one. In this case, if you used omdb to expunge the sled and then used this tool to re-execute the current blueprint (which Nexus can't do because it's down, but this tool should be able to because the database is up), it should allow Nexus to come back up.This tool is really intended for the development use case above. There may be some production use cases it could help with, but those are theoretical (the one crossed-out above does not work becauseexpungerequires Nexus to be running). And this tool is dangerous enough that I don't think we should ship this tool right now. If you want to use it, you have to build it yourself and copy it to the system you want to use it on.Things that are a few things that are safer than they seem about this: