Open
Description
TL;DR: Two different setups of Omicron, both sharing the same network, can share prefixes via ddm and "try to induct the other" into a view of the rack.
As mentioned in #1639, it seems that two separate instantiations of RSS can try to communicate with an overlapping set of sleds, and that may not be desired.
Here's a timeline:
- Sled 1 starts RSS, is configured to boot with a rack secret threshold of "1". Sled agent comes online.
- Sled 2 starts RSS, is configured to boot with a rack secret threshold of "1".
- Sled 2 sees sled 1, advertised by ddmd. This RSS creates a plan that includes both sled 1 and 2 in the view of the "rack".
- Sled 2's RSS sends a request to "start sled agent" to Sled 1. This request fails, because sled 1 already booted the sled agent by itself. (this was a very real error, seen here - thank you @iliana for noticing!).
Some ideas for mitigating:
- As part of the bootstrap protocol, advertise whether or not the sled agent is already running. Ignore these already running sleds when configuring RSS. There's the possibility of raciness here (which machine gets picked up by which RSS?) but it makes this case more explicitly handled.
- Make RSS generate a "rack UUID" - which we're doing for Nexus, anyway - that gets transmitted during the bootstrapping phases. This way, sleds can explicitly flag the error as "I'm already part of rack
foo
, but you're asking me to join rackbar
. - Help a human make the right call - in a non-automated, production environment, we'll be presenting a list of sleds to an operator for confirmation. Perhaps we could figure out a way to use location information when doing this presentation - like, "you have two sleds that claim to be in the physical slot 1, what do you want to do about that?"