Skip to content

TQ: Add External API for aborting a membership change#9741

Merged
andrewjstone merged 10 commits intomainfrom
tq-abort
Feb 3, 2026
Merged

TQ: Add External API for aborting a membership change#9741
andrewjstone merged 10 commits intomainfrom
tq-abort

Conversation

@andrewjstone
Copy link
Contributor

I tested this out by first trying to abort and watching it fail because there is no trust quorum configuration. Then I issued an LRTQ upgrade, which will fail because I didn't restart the sled-agents to pick up the LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly, I successfully issued a new LRTQ upgrade after restartng the sled agents and watched it commit.

Here's the external API calls:

➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
  "error_code": "Not Found",
  "message": "No trust quorum configuration exists for this rack",
  "request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜  oxide.rs git:(main) ✗
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
  "members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
  "state": "aborted",
  "time_aborted": "2026-01-29T01:54:02.590683Z",
  "time_committed": null,
  "time_created": "2026-01-29T01:37:07.476451Z",
  "unacknowledged_members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "version": 2
}

Here's the omdb calls:

root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade

Caused by:
    Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: Aborted,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: Some(
        2026-01-29T01:54:02.590683Z,
    ),
    abort_reason: Some(
        "Aborted via API request",
    ),
}

root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: Committed,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: Some(
        EncryptedRackSecrets {
            salt: Salt(
                [
                    143,
                    198,
                    3,
                    63,
                    136,
                    48,
                    212,
                    180,
                    101,
                    106,
                    50,
                    2,
                    251,
                    84,
                    234,
                    25,
                    46,
                    39,
                    139,
                    46,
                    29,
                    99,
                    252,
                    166,
                    76,
                    146,
                    78,
                    238,
                    28,
                    146,
                    191,
                    126,
                ],
            ),
            data: [
                167,
                223,
                29,
                18,
                50,
                230,
                103,
                71,
                159,
                77,
                118,
                39,
                173,
                97,
                16,
                92,
                27,
                237,
                125,
                173,
                53,
                51,
                96,
                242,
                203,
                70,
                36,
                188,
                200,
                59,
                251,
                53,
                126,
                48,
                182,
                141,
                216,
                162,
                240,
                5,
                4,
                255,
                145,
                106,
                97,
                62,
                91,
                161,
                51,
                110,
                220,
                16,
                132,
                29,
                147,
                60,
            ],
        },
    ),
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.792674Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
            ),
            time_prepared: Some(
                2026-01-29T02:20:47.236089Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.809779Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:52.248351Z,
            ),
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: Some(
        2026-01-29T02:20:47.597276Z,
    ),
    time_committed: Some(
        2026-01-29T02:21:52.263198Z,
    ),
    time_aborted: None,
    abort_reason: None,
}

I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.

Here's the external API calls:

```
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
  "error_code": "Not Found",
  "message": "No trust quorum configuration exists for this rack",
  "request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜  oxide.rs git:(main) ✗
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
  "members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
  "state": "aborted",
  "time_aborted": "2026-01-29T01:54:02.590683Z",
  "time_committed": null,
  "time_created": "2026-01-29T01:37:07.476451Z",
  "unacknowledged_members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "version": 2
}
```

Here's the omdb calls:

```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade

Caused by:
    Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: Aborted,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: Some(
        2026-01-29T01:54:02.590683Z,
    ),
    abort_reason: Some(
        "Aborted via API request",
    ),
}

root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: Committed,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: Some(
        EncryptedRackSecrets {
            salt: Salt(
                [
                    143,
                    198,
                    3,
                    63,
                    136,
                    48,
                    212,
                    180,
                    101,
                    106,
                    50,
                    2,
                    251,
                    84,
                    234,
                    25,
                    46,
                    39,
                    139,
                    46,
                    29,
                    99,
                    252,
                    166,
                    76,
                    146,
                    78,
                    238,
                    28,
                    146,
                    191,
                    126,
                ],
            ),
            data: [
                167,
                223,
                29,
                18,
                50,
                230,
                103,
                71,
                159,
                77,
                118,
                39,
                173,
                97,
                16,
                92,
                27,
                237,
                125,
                173,
                53,
                51,
                96,
                242,
                203,
                70,
                36,
                188,
                200,
                59,
                251,
                53,
                126,
                48,
                182,
                141,
                216,
                162,
                240,
                5,
                4,
                255,
                145,
                106,
                97,
                62,
                91,
                161,
                51,
                110,
                220,
                16,
                132,
                29,
                147,
                60,
            ],
        },
    ),
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.792674Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
            ),
            time_prepared: Some(
                2026-01-29T02:20:47.236089Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.809779Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:52.248351Z,
            ),
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: Some(
        2026-01-29T02:20:47.597276Z,
    ),
    time_committed: Some(
        2026-01-29T02:21:52.263198Z,
    ),
    time_aborted: None,
    abort_reason: None,
}
```
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handler LGTM; will defer to @ahl for the phrasing of the external API endpoint.

tags = ["experimental"],
versions = VERSION_TRUST_QUORUM_ABORT_CONFIG..
}]
async fn rack_membership_abort(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this take a specific RackMembershipVersionParam like ..._status does instead of implicitly aborting the latest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can only abort the latest membership, as only one trust quorum reconfiguration is allowed at a time.

I could see how that could be worrisome in the case of dueling administrators. However, it can only occur during the trust quorum prepare phase, which should be very short (I need to activate the bg task immediately rather than waiting on timeout in a future PR), so it's hard to do the wrong thing on human timescales. Additionally, even if it was done by accident, no harm, no foul. An admin can just reissue the last command to add the sleds again and it will work.

@andrewjstone andrewjstone mentioned this pull request Jan 29, 2026
48 tasks
After chatting with @davepacheco, I changed the authz checks in the
datastore to do lookups with Rack scope. This fixed the test bug, but is
only a shortcut. Trust quorum should have it's own authz object and I"m
going to open an issue for that.

Additionally, for methods that already took an authorized connection, I
removed the unnecessary authz checks and opctx parameter.
andrewjstone added a commit that referenced this pull request Jan 31, 2026
This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum
configuration via omdb. The second phase is to reboot the sled after
polling for commit the trust quorum removal. The third phase is to
issue the existing omdb expunge command, which changes the sled policy
as before.

The first and second phases remove the need to physically remove the
sled before expungement. They act as a software mechanism that gates the
sled-agent from restarting on the sled and doing work when it should be
treated as "absent". We've discussed this numerous times in the update
huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone
and remains the same except for an extra sanity check that that the last
committed trust quorum configuration does not contain the sled that is
to be expunged.

The removed sled may be added back to this rack or another after being
clean slated. I tested this by deleting the files in the internal
"cluster" and "config" directories and rebooting the removed sled in
a4x2 and it worked.

This PR is marked draft because it changes the current
sled-expunge pathway to depend on real trust quorum. We
cannot safely merge it in until the key-rotation work from
#9737 is merged in.

This also builds on #9741 and should merge after that PR.
Copy link
Contributor

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. It has me thinking about how we explain all this to customers... keeping in mind that only a small handful of operators at each customer are ever going to be exposed to this workflow and it's exactly the kind of uncommon workflow for which a console/UI wizard-style interface makes sense (i.e. the API is 99.99% going to be used by the console).

              add a sled <-+
                  |        |
                  v        |
    abort <---- wait       |
      |           |        |
      v           v        |
    wait ---> complete     |
                  |        |
                  +--------+

I don't know that that's particularly illustrative or accurate, but an operator would add some pile of sleds; they might abort that operation if it... gets stuck? or they wait for it to complete. If they try to add more sleds before it completes... I think that API call fails. At any time one can get status which tells us the current "rack membership"--essentially, which sleds are part of the resource pool.

probe_delete DELETE /experimental/v1/probes/{probe}
probe_list GET /experimental/v1/probes
probe_view GET /experimental/v1/probes/{probe}
rack_membership_abort POST /v1/system/hardware/racks/{rack_id}/membership/abort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what we discussed and I think this makes sense adjacent to add

req: TypedBody<params::RackMembershipAddSledsRequest>,
) -> Result<HttpResponseOk<RackMembershipStatus>, HttpError>;

/// Abort the latest rack membership change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • worth documenting if this operation in synchronous or asynchronous?
  • what are the semantics if the latest operation was already completed (error or success)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment, which is bound to be frustrating to someone due to lack of detail, which really can't be provided without mentioning trust quorum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you do. You wrote it!

@andrewjstone
Copy link
Contributor Author

This looks great. It has me thinking about how we explain all this to customers... keeping in mind that only a small handful of operators at each customer are ever going to be exposed to this workflow and it's exactly the kind of uncommon workflow for which a console/UI wizard-style interface makes sense (i.e. the API is 99.99% going to be used by the console).

              add a sled <-+
                  |        |
                  v        |
    abort <---- wait       |
      |           |        |
      v           v        |
    wait ---> complete     |
                  |        |
                  +--------+

I don't know that that's particularly illustrative or accurate, but an operator would add some pile of sleds; they might abort that operation if it... gets stuck? or they wait for it to complete. If they try to add more sleds before it completes... I think that API call fails. At any time one can get status which tells us the current "rack membership"--essentially, which sleds are part of the resource pool.

That diagram looks correct to me. I'm not sure how to better explain it to customers either.

@@ -19534,7 +19534,7 @@
]
},
"DiskSource": {
"description": "Different sources for a Distributed Disk",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why cargo xtask openapi generate would make this change on this PR after merging in main. Same for the others not related to trust quorum.

@andrewjstone andrewjstone enabled auto-merge (squash) February 2, 2026 21:41
req: TypedBody<params::RackMembershipAddSledsRequest>,
) -> Result<HttpResponseOk<RackMembershipStatus>, HttpError>;

/// Abort the latest rack membership change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it

@andrewjstone andrewjstone merged commit 5292c0d into main Feb 3, 2026
31 of 53 checks passed
@andrewjstone andrewjstone deleted the tq-abort branch February 3, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants