The bootstore is the crate that implements the Oxide trust quorum protocol. The bootstore provides the necessary key share storage required to unlock the rack secret via the trust quorum protocol and boot the rack.
We plan to incrementally improve the bootstore to be more full featured and
secure over time. Each implementation is referred to as a Scheme. Schemes are
each described by a zero-sized struct that can be debug printed as necessary.
You can see a the scheme description, represented as V0Scheme, for trust
quorum version 0 in ./src/schemes/v0/mod.rs.
The bootstore also provides a mechanism for replicating and storing early
networking configuration required to bring up the rest of the control plane.
This is implemented in the Node type, at a layer above the LRTQ FSM.
For the time being the generation of the RackSecret, its splitting into key
shares, and its distribution packages for use by the trust quorum scheme V0,
all live under the ./src/trust_quorum directory. We anticipate that
the RackSecret generation algorithms will remain the same across all schemes
for the foreseeable future. This will not be the case for the packages that
distribute key shares and metadata to different sleds, and so we version these.
The rack secret is used as input key material to the ../key-manager so that disks can be decrypted across sleds to allow the rack to boot. This allows the schemes to distribute key shares and rotate and reconstruct the rack secret to evolve and become more secure over time. It also enables the code using the rack secret to be kept consistent as long as it is fed the same shape of input key material.
Only Scheme V0 of the bootstore trust quorum protocol is implemented so far. The protocol itself is fully implemented and tested via generative, property based tests. The protocol code is structured as a state machine that mutates state and returns messages to send via its user API. It also informs the caller when the state must be persisted. Importantly, all code for the scheme v0 protocol is deterministic, apart from UUID generation, and performs no I/O. This allows us to write property based tests that simulate an entire cluster of sleds operating in concert and ensure that any test failures shrink correctly.
The trust quorum v0 scheme is limited in capability by the fact that a lot of the lower level software required for remote attestation and trusted transport is not yet complete. Our threat model is quite weak as a result and will be described in terms of what an attacker can and cannot do to break our system.
Our primary security goal of the v0 scheme is to prevent an attacker from gaining useful information off of U.2 drives stolen in a "casual" manner. By casual manner, we mean that an attacker walks up to a rack and pulls out a few U.2 drives from one or more sleds or steals 1 or 2 sleds in their entirety including U.2 and M.2 storage devices. An attacker takes these home, but is unable to decrypt the U.2 drives to recover useful information.
We meet this goal by having the M.2 devices on each sled store individual, and
unique, key shares that can be used to reconstruct a rack secret when K shares
are combined from the devices on K independent sleds. K is fixed at N/2
+ 1 of the initial members of the rack cluster during setup time such that if
there are 16 sleds in the rack, 9 M.2 devices from different sleds will have to
be stolen to reconstruct the secret. The online trust quorum protocol operating
over the bootstrap network serves to retrieve shares and reconstruct the rack
secret, and then the KeyManger uses the rack secret as input key material to
derive individual disk encryption keys for the U.2 drives.
Importantly, because we operate the online protocol over unencrypted TCP channels, an attacker on the bootstrap network may passively sniff data and recover the rack secret. Additionally, because sleds cannot securely attest to each other yet, an active attacker can participate in the protocol and recover the rack secret at will as long as they are on the bootstrap network. There is no authentication mechanism required to participate in the v0 scheme.
To re-iterate, our protocol only protects against offline attacks by an attacker with limited access to stolen hardware. The online protocol is quite insecure as a result of urgency. One may ask then: why deliver this subsystem at all? There are quite a few reasons for the v0 scheme. The primary reason is that we know that we want to encrypt control plane and customer data on U.2 drives in the future via ZFS encryption. We also do not want to force a complete wipe of an existing customer’s implementation, even if it is just test data for a prototype project. Since ZFS does not allow encrypting datasets after the fact and we setup our zpools and datsets at rack setup time, we must enable encryption and provide a key for each dataset on day 1. The question then becomes where do we get the input key material from. We essentially had 4 options:
-
Hardcode it - this is what is currently committed as the
HardcodedSecretRetriever -
Store random values constructed at RSS time on the M.2s and use them locally
-
Derive keys from VPD data unique to a sled
-
Build a simplified trust quorum scheme over untrusted channels on the bootstrap network
Option 1 is the absolute bare minimum to enable ZFS encryption. Options 2 and 3 don’t really protect from casual physical attacks as M.2 drives and entire sleds are almost as easily stolen as U.2 drives, and VPD data is not secret. We therefore opted to build out the v0 scheme to satisfy our casual access offline attack model. It’s important to note, that this threat model for offline access is identical to the full-fledged trust quorum as currently designed. It’s just that the online portion is woefully insecure. However, by building out version 0 early on we can gain experience operating in the shared secret model while protecting against some attacks. We do not anticipate staying in the v0 scheme for long. We will upgrade out of this scheme and rotate our rack secret and encryption keys to become more secure over time.
Here we get into the meat of what currently exists. The higher level bootstore
software, such as that that implements the network and persistence layers and
provides an async API for use by the bootstrap agent drives the protocol through
the v0 Fsm API methods in ./src/schemes/v0/fsm.rs. There are not many
methods, and each serves a purpose for use by software to be written later. The
overall flow will look something like the following at initial rack setup time:
-
Bootstrap agent uses DDM to learn addresses of peers.
-
The bootstore establishes a timeout per
Tickand calls theFsm::tickcallback on every timer tick. -
Bootstrap agent informs the bootstore about peers so TCP connections can be established on the bootstrap network
-
Bootstrap agent informs the higher level bootstore API on the same sled as RSS to initialize the rack with initial membership chosen by the wicket user.
-
The bootstore calls
Fsm::init_rackand gets back aResult<Option<ApiOutput>, ApiError> -
The output informs the bootstore to persist the local state to the M.2s and send the messages to peers addressed by their
BaseboardinEnvelopes. These envelopes are are retrieved from the Fsm via a call toFsm::drain_envelopes. -
Peers receive messages over TCP channels and respond according to the protocol
-
The rack is initialized
-
The bootstrap agent storage manager on each peer asks the
KeyManagerto unlock its disks, which results in a call to theRackSecretKeyRetrieverwhich eventually triggers a call toFsm::load_rack_secret, which returns anotherOutputcontaining messages since the shares need to be retrieved from peers. -
Messages make their way back to the bootstore and it handles them with
Fsm::handle, and this again returns aResult, and the ability to callFsm::drain_envelopes. -
Eventually, the
Resultreturns anApiOutput::RackSecretwhich contains the rack secret and the higher level request from the KeyManager can be satisifed.
The above is a relatively happy path, and in some cases peers can get
disconnected, timeouts can occur, and messages can be retransmitted as needed
on reconnection. In worse cases, rack initialization may timeout. In this case
RSS reset must be used to clear all the M.2 drives and try again. To prevent
problems with a lingering sled that does not come back online to be cleared,
that sled should be pulled. However, if it remains in the rack it will have a
different rack_uuid and so errors will be triggered when it tries to talk to
other sleds in the new, successfully initialized trust quorum.
Later on, we may wish to add a sled to a rack, most likely to replace a failed
one. In order to do this the bootstrap agent will have to be instructed to
trigger a call to the bootstore to inform the new sled that it is a "learning"
member. This will result in a call to Fsm::init_learner, which will trigger
a flow of messages that allow the new sled to learn a key share from one of
the other initial members. Each of the initial members keeps a set of encrypted
"extra" key shares for this purpose that can be decrypted via a key that is also
derived from the rack secret. The flow looks something like this:
sequenceDiagram
title Sled Addition
participant L as Learner
participant Im1 as InitialMember_1
participant Im2 as InitialMember_2
L->>Im1: Learn
Im1->>Im2: GetShare
Im2->>Im1: Share
Note over Im1: Decrypt extra shares
Im1->>L: KeyShare
Learners will rotate through known peers until they find one that has a share.
The primary method of testing is generative testing via
proptest. There are two
property based tests: one for running as the rack_coordinator and one for
running as a learner. Once the initial setup is performed to either initialize
the rack, or learn a LearnedSharePkg, the tests largely share the same
behavior in terms of processing generated `Action`s.
One important thing to note is in regard to message responses. We always send
responses to a request from the system under test (SUT) peer when a peer is
connected according to the TestState.
It is almost certainly useful for any reader/reviewer to go ahead and run the
proptests. It is also sometimes helpful to call println with the generated
test input to understand better how the pieces fit together. It’s important
to note that each proptest run will run multiple instances of our test. So
while our test is currently configured to generate up to 1000 actions, there
can be dozes of tests run. It’s also useful to modify code or the model to
try to break the test in certain ways and then watch it shrink to give you the
failing history.
HAVE FUN!