Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Bootstore

The bootstore is the crate that implements the Oxide trust quorum protocol. The bootstore provides the necessary key share storage required to unlock the rack secret via the trust quorum protocol and boot the rack.

We plan to incrementally improve the bootstore to be more full featured and secure over time. Each implementation is referred to as a Scheme. Schemes are each described by a zero-sized struct that can be debug printed as necessary. You can see a the scheme description, represented as V0Scheme, for trust quorum version 0 in ./src/schemes/v0/mod.rs.

The bootstore also provides a mechanism for replicating and storing early networking configuration required to bring up the rest of the control plane. This is implemented in the Node type, at a layer above the LRTQ FSM.

Rack Secret Generation and Share distribution

For the time being the generation of the RackSecret, its splitting into key shares, and its distribution packages for use by the trust quorum scheme V0, all live under the ./src/trust_quorum directory. We anticipate that the RackSecret generation algorithms will remain the same across all schemes for the foreseeable future. This will not be the case for the packages that distribute key shares and metadata to different sleds, and so we version these.

The rack secret is used as input key material to the ../key-manager so that disks can be decrypted across sleds to allow the rack to boot. This allows the schemes to distribute key shares and rotate and reconstruct the rack secret to evolve and become more secure over time. It also enables the code using the rack secret to be kept consistent as long as it is fed the same shape of input key material.

Scheme V0

Only Scheme V0 of the bootstore trust quorum protocol is implemented so far. The protocol itself is fully implemented and tested via generative, property based tests. The protocol code is structured as a state machine that mutates state and returns messages to send via its user API. It also informs the caller when the state must be persisted. Importantly, all code for the scheme v0 protocol is deterministic, apart from UUID generation, and performs no I/O. This allows us to write property based tests that simulate an entire cluster of sleds operating in concert and ensure that any test failures shrink correctly.

Threat Model and Security Goals

The trust quorum v0 scheme is limited in capability by the fact that a lot of the lower level software required for remote attestation and trusted transport is not yet complete. Our threat model is quite weak as a result and will be described in terms of what an attacker can and cannot do to break our system.

Our primary security goal of the v0 scheme is to prevent an attacker from gaining useful information off of U.2 drives stolen in a "casual" manner. By casual manner, we mean that an attacker walks up to a rack and pulls out a few U.2 drives from one or more sleds or steals 1 or 2 sleds in their entirety including U.2 and M.2 storage devices. An attacker takes these home, but is unable to decrypt the U.2 drives to recover useful information.

We meet this goal by having the M.2 devices on each sled store individual, and unique, key shares that can be used to reconstruct a rack secret when K shares are combined from the devices on K independent sleds. K is fixed at N/2 + 1 of the initial members of the rack cluster during setup time such that if there are 16 sleds in the rack, 9 M.2 devices from different sleds will have to be stolen to reconstruct the secret. The online trust quorum protocol operating over the bootstrap network serves to retrieve shares and reconstruct the rack secret, and then the KeyManger uses the rack secret as input key material to derive individual disk encryption keys for the U.2 drives.

Importantly, because we operate the online protocol over unencrypted TCP channels, an attacker on the bootstrap network may passively sniff data and recover the rack secret. Additionally, because sleds cannot securely attest to each other yet, an active attacker can participate in the protocol and recover the rack secret at will as long as they are on the bootstrap network. There is no authentication mechanism required to participate in the v0 scheme.

To re-iterate, our protocol only protects against offline attacks by an attacker with limited access to stolen hardware. The online protocol is quite insecure as a result of urgency. One may ask then: why deliver this subsystem at all? There are quite a few reasons for the v0 scheme. The primary reason is that we know that we want to encrypt control plane and customer data on U.2 drives in the future via ZFS encryption. We also do not want to force a complete wipe of an existing customer’s implementation, even if it is just test data for a prototype project. Since ZFS does not allow encrypting datasets after the fact and we setup our zpools and datsets at rack setup time, we must enable encryption and provide a key for each dataset on day 1. The question then becomes where do we get the input key material from. We essentially had 4 options:

  1. Hardcode it - this is what is currently committed as the HardcodedSecretRetriever

  2. Store random values constructed at RSS time on the M.2s and use them locally

  3. Derive keys from VPD data unique to a sled

  4. Build a simplified trust quorum scheme over untrusted channels on the bootstrap network

Option 1 is the absolute bare minimum to enable ZFS encryption. Options 2 and 3 don’t really protect from casual physical attacks as M.2 drives and entire sleds are almost as easily stolen as U.2 drives, and VPD data is not secret. We therefore opted to build out the v0 scheme to satisfy our casual access offline attack model. It’s important to note, that this threat model for offline access is identical to the full-fledged trust quorum as currently designed. It’s just that the online portion is woefully insecure. However, by building out version 0 early on we can gain experience operating in the shared secret model while protecting against some attacks. We do not anticipate staying in the v0 scheme for long. We will upgrade out of this scheme and rotate our rack secret and encryption keys to become more secure over time.

State Machine Flow

Here we get into the meat of what currently exists. The higher level bootstore software, such as that that implements the network and persistence layers and provides an async API for use by the bootstrap agent drives the protocol through the v0 Fsm API methods in ./src/schemes/v0/fsm.rs. There are not many methods, and each serves a purpose for use by software to be written later. The overall flow will look something like the following at initial rack setup time:

  1. Bootstrap agent uses DDM to learn addresses of peers.

  2. The bootstore establishes a timeout per Tick and calls the Fsm::tick callback on every timer tick.

  3. Bootstrap agent informs the bootstore about peers so TCP connections can be established on the bootstrap network

  4. Bootstrap agent informs the higher level bootstore API on the same sled as RSS to initialize the rack with initial membership chosen by the wicket user.

  5. The bootstore calls Fsm::init_rack and gets back a Result<Option<ApiOutput>, ApiError>

  6. The output informs the bootstore to persist the local state to the M.2s and send the messages to peers addressed by their Baseboard in Envelopes. These envelopes are are retrieved from the Fsm via a call to Fsm::drain_envelopes.

  7. Peers receive messages over TCP channels and respond according to the protocol

  8. The rack is initialized

  9. The bootstrap agent storage manager on each peer asks the KeyManager to unlock its disks, which results in a call to the RackSecretKeyRetriever which eventually triggers a call to Fsm::load_rack_secret, which returns another Output containing messages since the shares need to be retrieved from peers.

  10. Messages make their way back to the bootstore and it handles them with Fsm::handle, and this again returns a Result, and the ability to call Fsm::drain_envelopes.

  11. Eventually, the Result returns an ApiOutput::RackSecret which contains the rack secret and the higher level request from the KeyManager can be satisifed.

The above is a relatively happy path, and in some cases peers can get disconnected, timeouts can occur, and messages can be retransmitted as needed on reconnection. In worse cases, rack initialization may timeout. In this case RSS reset must be used to clear all the M.2 drives and try again. To prevent problems with a lingering sled that does not come back online to be cleared, that sled should be pulled. However, if it remains in the rack it will have a different rack_uuid and so errors will be triggered when it tries to talk to other sleds in the new, successfully initialized trust quorum.

Later on, we may wish to add a sled to a rack, most likely to replace a failed one. In order to do this the bootstrap agent will have to be instructed to trigger a call to the bootstore to inform the new sled that it is a "learning" member. This will result in a call to Fsm::init_learner, which will trigger a flow of messages that allow the new sled to learn a key share from one of the other initial members. Each of the initial members keeps a set of encrypted "extra" key shares for this purpose that can be decrypted via a key that is also derived from the rack secret. The flow looks something like this:

sequenceDiagram
    title Sled Addition
    participant L as Learner
    participant Im1 as InitialMember_1
    participant Im2 as InitialMember_2
    L->>Im1: Learn
    Im1->>Im2: GetShare
    Im2->>Im1: Share
    Note over Im1: Decrypt extra shares
    Im1->>L: KeyShare
Loading

Learners will rotate through known peers until they find one that has a share.

Testing strategy

The primary method of testing is generative testing via proptest. There are two property based tests: one for running as the rack_coordinator and one for running as a learner. Once the initial setup is performed to either initialize the rack, or learn a LearnedSharePkg, the tests largely share the same behavior in terms of processing generated `Action`s.

One important thing to note is in regard to message responses. We always send responses to a request from the system under test (SUT) peer when a peer is connected according to the TestState.

It is almost certainly useful for any reader/reviewer to go ahead and run the proptests. It is also sometimes helpful to call println with the generated test input to understand better how the pieces fit together. It’s important to note that each proptest run will run multiple instances of our test. So while our test is currently configured to generate up to 1000 actions, there can be dozes of tests run. It’s also useful to modify code or the model to try to break the test in certain ways and then watch it shrink to give you the failing history.

HAVE FUN!