Skip to content

Make RSS resistant to intermingled deployments #1640

Open
@smklein

Description

@smklein

TL;DR: Two different setups of Omicron, both sharing the same network, can share prefixes via ddm and "try to induct the other" into a view of the rack.

As mentioned in #1639, it seems that two separate instantiations of RSS can try to communicate with an overlapping set of sleds, and that may not be desired.

Here's a timeline:

  • Sled 1 starts RSS, is configured to boot with a rack secret threshold of "1". Sled agent comes online.
  • Sled 2 starts RSS, is configured to boot with a rack secret threshold of "1".
  • Sled 2 sees sled 1, advertised by ddmd. This RSS creates a plan that includes both sled 1 and 2 in the view of the "rack".
  • Sled 2's RSS sends a request to "start sled agent" to Sled 1. This request fails, because sled 1 already booted the sled agent by itself. (this was a very real error, seen here - thank you @iliana for noticing!).

Some ideas for mitigating:

  • As part of the bootstrap protocol, advertise whether or not the sled agent is already running. Ignore these already running sleds when configuring RSS. There's the possibility of raciness here (which machine gets picked up by which RSS?) but it makes this case more explicitly handled.
  • Make RSS generate a "rack UUID" - which we're doing for Nexus, anyway - that gets transmitted during the bootstrapping phases. This way, sleds can explicitly flag the error as "I'm already part of rack foo, but you're asking me to join rack bar.
  • Help a human make the right call - in a non-automated, production environment, we'll be presenting a list of sleds to an operator for confirmation. Perhaps we could figure out a way to use location information when doing this presentation - like, "you have two sleds that claim to be in the physical slot 1, what do you want to do about that?"

Metadata

Metadata

Assignees

No one assigned

    Labels

    developmentBugs, paper cuts, feature requests, or other thoughts on making omicron development better

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions