TQ: Support sled expunge via trust quorum pathway#9765
TQ: Support sled expunge via trust quorum pathway#9765andrewjstone wants to merge 10 commits intomainfrom
Conversation
I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.
Here's the external API calls:
```
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
"error_code": "Not Found",
"message": "No trust quorum configuration exists for this rack",
"request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜ oxide.rs git:(main) ✗
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
"members": [
{
"part_number": "913-0000019",
"serial_number": "20000000"
},
{
"part_number": "913-0000019",
"serial_number": "20000001"
},
{
"part_number": "913-0000019",
"serial_number": "20000003"
}
],
"rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
"state": "aborted",
"time_aborted": "2026-01-29T01:54:02.590683Z",
"time_committed": null,
"time_created": "2026-01-29T01:37:07.476451Z",
"unacknowledged_members": [
{
"part_number": "913-0000019",
"serial_number": "20000000"
},
{
"part_number": "913-0000019",
"serial_number": "20000001"
},
{
"part_number": "913-0000019",
"serial_number": "20000003"
}
],
"version": 2
}
```
Here's the omdb calls:
```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade
Caused by:
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
2,
),
last_committed_epoch: None,
state: PreparingLrtqUpgrade,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T01:37:07.476451Z,
time_committing: None,
time_committed: None,
time_aborted: None,
abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
2,
),
last_committed_epoch: None,
state: Aborted,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T01:37:07.476451Z,
time_committing: None,
time_committed: None,
time_aborted: Some(
2026-01-29T01:54:02.590683Z,
),
abort_reason: Some(
"Aborted via API request",
),
}
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
3,
),
last_committed_epoch: None,
state: PreparingLrtqUpgrade,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T02:20:03.848507Z,
time_committing: None,
time_committed: None,
time_aborted: None,
abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
3,
),
last_committed_epoch: None,
state: Committed,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: Some(
EncryptedRackSecrets {
salt: Salt(
[
143,
198,
3,
63,
136,
48,
212,
180,
101,
106,
50,
2,
251,
84,
234,
25,
46,
39,
139,
46,
29,
99,
252,
166,
76,
146,
78,
238,
28,
146,
191,
126,
],
),
data: [
167,
223,
29,
18,
50,
230,
103,
71,
159,
77,
118,
39,
173,
97,
16,
92,
27,
237,
125,
173,
53,
51,
96,
242,
203,
70,
36,
188,
200,
59,
251,
53,
126,
48,
182,
141,
216,
162,
240,
5,
4,
255,
145,
106,
97,
62,
91,
161,
51,
110,
220,
16,
132,
29,
147,
60,
],
},
),
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
),
time_prepared: Some(
2026-01-29T02:20:46.792674Z,
),
time_committed: Some(
2026-01-29T02:21:49.503179Z,
),
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
),
time_prepared: Some(
2026-01-29T02:20:47.236089Z,
),
time_committed: Some(
2026-01-29T02:21:49.503179Z,
),
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
),
time_prepared: Some(
2026-01-29T02:20:46.809779Z,
),
time_committed: Some(
2026-01-29T02:21:52.248351Z,
),
},
},
time_created: 2026-01-29T02:20:03.848507Z,
time_committing: Some(
2026-01-29T02:20:47.597276Z,
),
time_committed: Some(
2026-01-29T02:21:52.263198Z,
),
time_aborted: None,
abort_reason: None,
}
```
After chatting with @davepacheco, I changed the authz checks in the datastore to do lookups with Rack scope. This fixed the test bug, but is only a shortcut. Trust quorum should have it's own authz object and I"m going to open an issue for that. Additionally, for methods that already took an authorized connection, I removed the unnecessary authz checks and opctx parameter.
This commit adds a 3 phase mechanism for sled expungement. The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for commit the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before. The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving! The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that that the last committed trust quorum configuration does not contain the sled that is to be expunged. The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked. This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in. This also builds on #9741 and should merge after that PR.
dev-tools/omdb/src/bin/omdb/nexus.rs
Outdated
| the Reconfigurator will not yet know the sled is expunged and may \ | ||
| still try to use it. | ||
|
|
||
| Therefore, you must treat this action in conjunction with a reboot as \ |
There was a problem hiding this comment.
This formatting is definitely off. We should consider what we want this to look like. My guess is just removing the \ from the formatting altogether so it isn't very long lines with weird indents.
There was a problem hiding this comment.
Maybe either:
- three separate
println!s so we're not mixing "long strings that need\-continuations" with "I actually want newlines here" \on every line, with explicit\n\on lines where you want newlines
?
There was a problem hiding this comment.
I did the separate println! version in 2c55d34
dev-tools/omdb/src/bin/omdb/nexus.rs
Outdated
| the Reconfigurator will not yet know the sled is expunged and may \ | ||
| still try to use it. | ||
|
|
||
| Therefore, you must treat this action in conjunction with a reboot as \ |
There was a problem hiding this comment.
Maybe either:
- three separate
println!s so we're not mixing "long strings that need\-continuations" with "I actually want newlines here" \on every line, with explicit\n\on lines where you want newlines
?
dev-tools/omdb/src/bin/omdb/nexus.rs
Outdated
| println!( | ||
| "About to start the trust quorum reconfiguration to remove the sled. | ||
|
|
||
| If this operation with a timeout, please check the latest trust quorum \ |
There was a problem hiding this comment.
If this operation with a timeout,
Missing a word here?
dev-tools/omdb/src/bin/omdb/nexus.rs
Outdated
| expungement. | ||
|
|
||
| You can poll the trust quorum reconfiguration with \ | ||
| `omdb nexus trust-quorum get-config <RACK_ID> <EPOCH | latest>`\n" |
There was a problem hiding this comment.
This last bit looks duplicated by the next println!
There was a problem hiding this comment.
Good call. I removed the following println!
| .context("trust quorum remove sled")? | ||
| .into_inner(); | ||
|
|
||
| println!("Started trust quorum reconfiguration at epoch {epoch}\n"); |
There was a problem hiding this comment.
Could / should we go into a polling loop here so the operator doesn't have to do it manually (in the happy path)?
There was a problem hiding this comment.
I thought we discussed this and explicitly decided to make this operation not block. I'm not sure how we can tell the happy path from the sad path here besides a timeout, and I'm not sure what timeout to use.
While I would really like to make all these operations one shot and simple to use, they inherently are not that way.
There was a problem hiding this comment.
Yeahhh I remember discussing that but then reading this I didn't remember why. No objection to landing this as-is. We could potentially add a "...wait for trust-quorum reconfig to complete ..." subcommand if manual polling is annoying, maybe?
There was a problem hiding this comment.
I'd prefer to keep it as is for now, and see if we can go back and find a better mechanism for all this. I expect that we'll simplify as much as possible once we add an external command for expunge, but I'm open to changing this before that happens also.
| let rack_id = RackUuid::from_untyped_uuid(sled.rack_id); | ||
|
|
||
| // If the sled still exists in the latest committed trust quorum | ||
| // configuration, it cannot be expunged. |
There was a problem hiding this comment.
What does this do on racks that are still running in LRTQ?
There was a problem hiding this comment.
An error will be returned:
return Err(Error::invalid_request(format!(
"Missing trust quorum configurations for rack {rack_id}. \
Upgrade to trust quorum required."
)));There was a problem hiding this comment.
Does that mean once this ships, we can't expunge a sled on an LRTQ rack before upgrading it to real TQ?
There was a problem hiding this comment.
Correct. The call to lrtq-upgrade should be the first thing we do after install completes.
There was a problem hiding this comment.
In the very unlikely event a sled fails after the install but before we get a chance to lrtq-upgrade, will lrtq-upgrade be blocked by a missing / unresponsive sled?
There was a problem hiding this comment.
That's a great question. I will double check, but I'm pretty sure the code and property based tests take this into account. LRTQ upgrade should operate in the same manner as a a regular reconfiguration where it allows a certain number of sleds to be absent/fail during both prepare and commit phase.
There was a problem hiding this comment.
I just double checked. Inserting LRTQ configs is the same as non-lrtq configs except for the IsLrtqUpgrade field. Eventually, this boils down to a call to TrustQuorumConfig::new after validation. Inside this method we can see that the choice of commit_crash_tolerance is only parameterized on the number of sleds in the rack.
nexus/src/app/trust_quorum.rs
Outdated
| else { | ||
| return Err(Error::internal_error(&format!( | ||
| "Cannot retrieve newly inserted trust quorum \ | ||
| configuration for rack {rack_id}, epoch {new_epoch}." |
There was a problem hiding this comment.
rustfmt sigh
| configuration for rack {rack_id}, epoch {new_epoch}." | |
| configuration for rack {rack_id}, epoch {new_epoch}." |
|
|
||
| // Now send the reconfiguration request to the coordinator. We do | ||
| // this directly in the API handler because this is a non-idempotent | ||
| // operation and we only want to issue it once. |
There was a problem hiding this comment.
How do we proceed if we've inserted the latest config in the db, but fail before successfully sending the reconfigure operation? (Or if the sled receives the reconfigure then immediately crashes or whatever)
There was a problem hiding this comment.
This is exactly the reason the abort operation exists. The problem there is that it's difficult (if not impossible) to always abort correctly in an automated fashion. So the customer has to be like: "I've been waiting 20 minutes, wtf is going on" and then try to abort. This is easier to tell with full trust quorum status in omdb since we differentiate prepare from commit phases. And you can also get some information by looking at inventory.
Once you abort, you can go ahead and try to expunge again. A new coordinator will be chosen randomly from members of the latest committed trust quorum.
|
Damn, Looks like the changes to expunge are breaking existing tests. I'll either need to update those tests, update the expunge function, or move the check inside the expunge function into omdb. |
This commit adds a 3 phase mechanism for sled expungement.
The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for the commit of the configuration with the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before.
The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving!
The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that the last committed trust quorum configuration does not contain the sled that is to be expunged.
The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked.
This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in.
This also builds on #9741 and should merge after that PR.