Make RSS resistant to intermingled deployments

TL;DR: Two different setups of Omicron, both sharing the same network, can share prefixes via ddm and "try to induct the other" into a view of the rack.

As mentioned in https://github.com/oxidecomputer/omicron/issues/1639, it seems that two separate instantiations of RSS can try to communicate with an overlapping set of sleds, and that may not be desired.

Here's a timeline:
- Sled 1 starts RSS, is configured to boot with a rack secret threshold of "1". Sled agent comes online.
- Sled 2 starts RSS, is configured to boot with a rack secret threshold of "1".
- Sled 2 sees sled 1, advertised by ddmd. This RSS creates a plan that includes both sled 1 and 2 in the view of the "rack".
- Sled 2's RSS sends a request to "start sled agent" to Sled 1. This request fails, because sled 1 already booted the sled agent by itself. (this was a very real error, seen [here](https://buildomat.eng.oxide.computer/wg/0/artefact/01GAPV78Y25JQ6N8KDWFD3PMC8/CgkkE4LN5tpJogucAvQFe9PW37gM1p0A72O9M7FIXWaS7j1Y/01GAPV7T0ENAHBRNHD3N0KZJ2V/01GAPX7X3NAF77W14DBY1N018Z/system-illumos-sled-agent:default.log?format=x-bunyan#L38) - thank you @iliana for noticing!).

Some ideas for mitigating:
- [ ] As part of the bootstrap protocol, advertise whether or not the sled agent is already running. Ignore these already running sleds when configuring RSS. There's the possibility of raciness here (which machine gets picked up by which RSS?) but it makes this case more explicitly handled.
- [ ] Make RSS generate a "rack UUID" - which we're doing for Nexus, anyway - that gets transmitted during the bootstrapping phases. This way, sleds can explicitly flag the error as "I'm already part of rack `foo`, but you're asking me to join rack `bar`.
- [ ] Help a human make the right call - in a non-automated, production environment, we'll be presenting a list of sleds to an operator for confirmation. Perhaps we could figure out a way to use location information when doing this presentation - like, "you have two sleds that claim to be in the physical slot 1, what do you want to do about that?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make RSS resistant to intermingled deployments #1640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make RSS resistant to intermingled deployments #1640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions