-
Notifications
You must be signed in to change notification settings - Fork 45
Bootstore message handler implementation #1687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
conn.immediate_transaction(|tx| { | ||
// Does the rack_uuid match what's stored? | ||
self.validate_rack_uuid(tx, rack_uuid)?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I wrote in the PR description that we should not accept a Commit if there is even a prepare for a later epoch. We don't do that here. I'm not actually sure if we should. For instance, is it possible that a node was prepared, offline for the commit, then somehow got a later prepare that included itself? That seems unlikely and maybe we can prevent it at the nexus layer. However, if we don't prevent it, we may still need the commit to allow the sled to unlock itself while a new prepare is in progress. I need to think about this some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"m leaning toward actually allowing this. I'm not sure it's harmful. I will continue to think about it as this protocol evolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
&mut self, | ||
&self, | ||
conn: &mut SqliteConnection, | ||
rack_uuid: &Uuid, | ||
epoch: i32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can epochs be negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOOOOOOOOOO. However, the type of INTEGER in sqlite is an i32, and even BIGINT is just i64. Rather than forcing lossy conversion I kept the types the same. I'm definitely open to changing this although it shouldn't be much of a problem here since we only start with epoch 0 and reject updates for earlier epochs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to suggest making epoch
a u32
in all the public APIs of this crate; the fact that SQLite forces us into i32 is bleeding out here. But if that's painful I don't think it's a huge deal given the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure it's better to use a u32 given the constraints. I'm gonna leave it for now just to get this merged, but will keep it in mind.
Crucible changes are: Make `ImpactedBlocks` an absolute range (#1685) update omicron (#1687) Update Rust crate chrono to v0.4.40 (#1682) Update Rust crate tempfile to v3.19.0 (#1676) Log loops during crutest replay (#1674) Refactor live-repair start and send (#1673) Update Rust crate libc to v0.2.171 (#1684) Update Rust crate clap to v4.5.32 (#1683) Update Rust crate async-trait to 0.1.88 (#1680) Update Rust crate bytes to v1.10.1 (#1681) Update Rust crate anyhow to v1.0.97 (#1679) Update Rust crate http to v1 (#1678) Update Rust crate tokio-util to v0.7.14 (#1675) Update Rust crate thiserror to v2 (#1657) Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust crate opentelemetry to 0.28.0 (#1543) Update Rust crate itertools to 0.14.0 (#1646) Update Rust crate clearscreen to v4 (#1656) nightly test polish (#1665) Update Rust crate strum_macros to 0.27 (#1654) Update Rust crate strum to 0.27 (#1653) Update Rust crate rusqlite to 0.34 (#1652) Better return semantics from should_send Trying to simplify enqueue Remove unnecessary argument Remove unused code from `ImpactedBlocks` (#1668) Log when pantry/volume layers activate things (#1670) Fix unused import (#1669) Use `RangeSet` instead of tracking every complete job (#1663) Update Rust crate dropshot to 0.16.0 (#1645) Simplify pending work tracking (#1662) Remove rusqlite dependency from crucible-common (#1659) Update Rust crate reedline to 0.38.0 (#1651) Update Rust crate proptest to 1.6.0 (#1648) Update Rust crate bytes to v1.10.0 (#1644) Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate tracing to v0.1.41 (#1642) Update Rust crate toml to v0.8.20 (#1641) Update Rust crate thiserror to v1.0.69 (#1640) Update Rust crate serde_json to v1.0.139 (#1639) Update Rust crate serde to v1.0.218 (#1638) Update Rust crate reqwest to v0.12.12 (#1637) Update Rust crate libc to v0.2.170 (#1636) Update Rust crate indicatif to 0.17.11 (#1635) Update Rust crate httptest to 0.16.3 (#1634) Update Rust crate clap to v4.5.31 (#1632) Update Rust crate chrono to v0.4.39 (#1631) Update Rust crate byte-unit to 5.1.6 (#1630) Update Rust crate async-trait to 0.1.86 (#1629) Update Rust crate csv to 1.3.1 (#1633) Update Rust crate anyhow to v1.0.96 (#1628) Fix a test flake (#1626) Bitpack per-block slot data in memory (#1625) Use `div_ceil` instead of `(x + 7) / 8` (#1624) Add repair server dynamometer (#1618) Propolis changes are: Paper over minor() differences Update crucible (#886) Test oximeter metrics end to end (#855) server: improve behavior of starting VMs that are waiting for Crucible activation (#873) server: extend API request queue's memory of prior requests (#880) Use production propolis-server flags in PHD CI builds (#878) Added a new field to the Board struct that propolis requires.
Crucible changes are: Make `ImpactedBlocks` an absolute range (#1685) update omicron (#1687) Update Rust crate chrono to v0.4.40 (#1682) Update Rust crate tempfile to v3.19.0 (#1676) Log loops during crutest replay (#1674) Refactor live-repair start and send (#1673) Update Rust crate libc to v0.2.171 (#1684) Update Rust crate clap to v4.5.32 (#1683) Update Rust crate async-trait to 0.1.88 (#1680) Update Rust crate bytes to v1.10.1 (#1681) Update Rust crate anyhow to v1.0.97 (#1679) Update Rust crate http to v1 (#1678) Update Rust crate tokio-util to v0.7.14 (#1675) Update Rust crate thiserror to v2 (#1657) Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust crate opentelemetry to 0.28.0 (#1543) Update Rust crate itertools to 0.14.0 (#1646) Update Rust crate clearscreen to v4 (#1656) nightly test polish (#1665) Update Rust crate strum_macros to 0.27 (#1654) Update Rust crate strum to 0.27 (#1653) Update Rust crate rusqlite to 0.34 (#1652) Better return semantics from should_send Trying to simplify enqueue Remove unnecessary argument Remove unused code from `ImpactedBlocks` (#1668) Log when pantry/volume layers activate things (#1670) Fix unused import (#1669) Use `RangeSet` instead of tracking every complete job (#1663) Update Rust crate dropshot to 0.16.0 (#1645) Simplify pending work tracking (#1662) Remove rusqlite dependency from crucible-common (#1659) Update Rust crate reedline to 0.38.0 (#1651) Update Rust crate proptest to 1.6.0 (#1648) Update Rust crate bytes to v1.10.0 (#1644) Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate tracing to v0.1.41 (#1642) Update Rust crate toml to v0.8.20 (#1641) Update Rust crate thiserror to v1.0.69 (#1640) Update Rust crate serde_json to v1.0.139 (#1639) Update Rust crate serde to v1.0.218 (#1638) Update Rust crate reqwest to v0.12.12 (#1637) Update Rust crate libc to v0.2.170 (#1636) Update Rust crate indicatif to 0.17.11 (#1635) Update Rust crate httptest to 0.16.3 (#1634) Update Rust crate clap to v4.5.31 (#1632) Update Rust crate chrono to v0.4.39 (#1631) Update Rust crate byte-unit to 5.1.6 (#1630) Update Rust crate async-trait to 0.1.86 (#1629) Update Rust crate csv to 1.3.1 (#1633) Update Rust crate anyhow to v1.0.96 (#1628) Fix a test flake (#1626) Bitpack per-block slot data in memory (#1625) Use `div_ceil` instead of `(x + 7) / 8` (#1624) Add repair server dynamometer (#1618) Propolis changes are: Paper over minor() differences Update crucible (#886) Test oximeter metrics end to end (#855) server: improve behavior of starting VMs that are waiting for Crucible activation (#873) server: extend API request queue's memory of prior requests (#880) Use production propolis-server flags in PHD CI builds (#878) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
This is the second PR of the
bootstore
implementation. It creates handler methods for all existing messages and their corresponding DB requirements. RFD 238 section 5 will need to be updated to reflect this flow.The way the flow currently is supposed to work is that there is a coordinator (coming in a follow up PR) that sends the following messages to initialize the trust quorum:
Initialize
KeyShareCommit
(for epoch 0)Initialize
can be sent multiple times until theKeyShareCommit
for epoch 0 locks in the configuration and Rack UUID. This allows the initial coordinator (RSS) to die, and even have its local storage or entire sled die, and still allow us to make progress via a new RSS. Importantly, once wired through the system, the coordinator will negotiate (via sled-agent and nexus) to write theKeyShareCommit
to CockroachDB before sending it to all nodes. This ensures that the rack initialization will complete and CockroachDB only has to be setup once. There are some tricky failure modes here, but none of that is in this PR, so we can move on. The important point here is that once aKeyShareCommit
for epoch 0 succeeds the rack is initialized.After initialization, reconfiguration of the trust quorum can take place. Reconfiguration messages must be issued with the same Rack UUID. A reconfiguration consists of a
KeySharePrepare
for anepoch > 0
followed by aKeyShareCommit
for the same epoch. After initialization aKeySharePrepare
can never be mutated. However, to allow for certain failures, such as the failure of a node being prepared to ack (since we require unanimous acknowledgement in 2PC), we allow for newKeySharePrepare
s at higher epochs to proceed. We keep track of these epochs in CockroachDB to maintain linearizability. Once a node receives a prepare for a higher epoch, it will no longer accept prepare or commit messages for any lower epoch. Once a successfulKeyShareCommit
is handled, the sled is now operating at the new epoch.GetShare
requests are used to unlock trust quorum. They will rely on sprockets at the network level for authentication, confidentiality, integrity, and attestation. Some DeviceId membership validation will also be added toNode
in future commits as it currently exists in sled agent.This is an incremental PR and there is still a lot to do here. Follow up commits will:
KeySharePrepare
. This is the most straightforward way to transfer wrapped secrets. We can get more sophisticated in the future. Again, this needs to be written up in RFD 238, but I'm likely to just implement it within the next week and see how it works first.After that, the
bootstore
will basically be implemented at a non-network level. We'll then likely want to port over the sprockets code from the bootstrap agent to a higher level here, and use theNode
andCoordinator
from that level, which will end up interacting with Nexus and CockroachDB via the sled-agent. Multiple tasks will be utilizing theNode
and its underlyingDb
at this point and so we'll need some concurrency control. I've given this a bit of thought and my current thinking is to have theNode
running in it's own thread and serialize all operation via a channel. Outside of reconfiguration, all operations are reads, and we can simply cache the current share in memory for rack unlock purposes (e.g.GetShare
). During reconfiguration, we'll mutate the DB as necessary, and refresh our cache. This strategy avoids the need for mutexes and has the characteristic that theNode
andDb
code we run in a single threaded manner in property based tests is actually running the same way in production with just a single DB connection. This strategy isn't set in stone, but in my mind it's the easiest to reason about and also the easiest way to hook async to sync code.I tagged @jclulow here because he's a SQLite maven, and wanted a quick look to see if I'm doing things in a sane manner. Any tips would be appreciated. I expect @jgallagher to be my primary logic reviewer for the rest of this :)