Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 83 additions & 98 deletions nexus/db-queries/src/db/datastore/trust_quorum.rs

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions nexus/external-api/output/nexus_tags.txt
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ probe_create POST /experimental/v1/probes
probe_delete DELETE /experimental/v1/probes/{probe}
probe_list GET /experimental/v1/probes
probe_view GET /experimental/v1/probes/{probe}
rack_membership_abort POST /v1/system/hardware/racks/{rack_id}/membership/abort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what we discussed and I think this makes sense adjacent to add

rack_membership_add_sleds POST /v1/system/hardware/racks/{rack_id}/membership/add
rack_membership_status GET /v1/system/hardware/racks/{rack_id}/membership
support_bundle_create POST /experimental/v1/system/support-bundles
Expand Down
19 changes: 19 additions & 0 deletions nexus/external-api/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ api_versions!([
// | date-based version should be at the top of the list.
// v
// (next_yyyymmddnn, IDENT),
(2026020200, TRUST_QUORUM_ABORT_CONFIG),
(2026013100, READ_ONLY_DISKS_NULLABLE),
(2026013001, READ_ONLY_DISKS),
(2026013000, INSTANCES_EXTERNAL_SUBNETS),
Expand Down Expand Up @@ -4819,6 +4820,24 @@ pub trait NexusExternalApi {
req: TypedBody<params::RackMembershipAddSledsRequest>,
) -> Result<HttpResponseOk<RackMembershipStatus>, HttpError>;

/// Abort the latest rack membership change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • worth documenting if this operation in synchronous or asynchronous?
  • what are the semantics if the latest operation was already completed (error or success)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment, which is bound to be frustrating to someone due to lack of detail, which really can't be provided without mentioning trust quorum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you do. You wrote it!

///
/// This operation is synchronous. Upon returning from the API call, a
/// success response indicates that the prior membership change was aborted.
/// An error response indicates that there is no active membership change
/// in progress (previous changes have completed) or that the current
/// membership change could not be aborted.
#[endpoint {
method = POST,
path = "/v1/system/hardware/racks/{rack_id}/membership/abort",
tags = ["experimental"],
versions = VERSION_TRUST_QUORUM_ABORT_CONFIG..
}]
async fn rack_membership_abort(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this take a specific RackMembershipVersionParam like ..._status does instead of implicitly aborting the latest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can only abort the latest membership, as only one trust quorum reconfiguration is allowed at a time.

I could see how that could be worrisome in the case of dueling administrators. However, it can only occur during the trust quorum prepare phase, which should be very short (I need to activate the bg task immediately rather than waiting on timeout in a future PR), so it's hard to do the wrong thing on human timescales. Additionally, even if it was done by accident, no harm, no foul. An admin can just reissue the last command to add the sleds again and it will work.

rqctx: RequestContext<Self::Context>,
path_params: Path<params::RackPath>,
) -> Result<HttpResponseOk<RackMembershipStatus>, HttpError>;

/// Retrieve the rack cluster membership status
///
/// Returns the status for the most recent change, or a specific version if
Expand Down
35 changes: 35 additions & 0 deletions nexus/src/app/trust_quorum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,41 @@ impl super::Nexus {
Ok(new_config)
}

/// Abort the latest trust quorum configuration if it is still in the
/// preparing state.
pub(crate) async fn tq_abort_latest_config(
&self,
opctx: &OpContext,
rack_id: RackUuid,
) -> Result<TrustQuorumConfig, Error> {
let Some(latest_config) =
self.db_datastore.tq_get_latest_config(opctx, rack_id).await?
else {
return Err(Error::non_resourcetype_not_found(
"No trust quorum configuration exists for this rack",
));
};

self.db_datastore
.tq_abort_config(
opctx,
rack_id,
latest_config.epoch,
"Aborted via API request".to_string(),
)
.await?;

// Return the updated configuration
self.db_datastore
.tq_get_config(opctx, rack_id, latest_config.epoch)
.await?
.ok_or_else(|| {
Error::internal_error(
"Configuration was just aborted but cannot be retrieved",
)
})
}

/// Read the set of all commissioned sleds from an existing deployment and
/// issue an upgrade request to trust quorum.
///
Expand Down
13 changes: 13 additions & 0 deletions nexus/src/external_api/http_entrypoints.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6145,6 +6145,19 @@ impl NexusExternalApi for NexusExternalApiImpl {
.await
}

async fn rack_membership_abort(
rqctx: RequestContext<Self::Context>,
path_params: Path<params::RackPath>,
) -> Result<HttpResponseOk<RackMembershipStatus>, HttpError> {
audit_and_time(&rqctx, |opctx, nexus| async move {
let rack_id =
RackUuid::from_untyped_uuid(path_params.into_inner().rack_id);
let status = nexus.tq_abort_latest_config(&opctx, rack_id).await?;
Ok(HttpResponseOk(status.into()))
})
.await
}

async fn rack_membership_status(
rqctx: RequestContext<Self::Context>,
path_params: Path<params::RackMembershipConfigPathParams>,
Expand Down
68 changes: 43 additions & 25 deletions nexus/tests/integration_tests/endpoints.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ use omicron_common::api::external::UserId;
use omicron_common::api::external::VpcFirewallRuleUpdateParams;
use omicron_test_utils::certificates::CertificateChain;
use semver::Version;
use std::collections::BTreeSet;
use std::net::IpAddr;
use std::net::Ipv4Addr;
use std::num::NonZeroU32;
Expand All @@ -66,6 +67,27 @@ pub static HARDWARE_SLED_PROVISION_POLICY_URL: LazyLock<String> =
SLED_AGENT_UUID
)
});
pub static HARDWARE_RACK_MEMBERSHIP_URL: LazyLock<String> =
LazyLock::new(|| {
format!("/v1/system/hardware/racks/{}/membership", RACK_UUID)
});
pub static HARDWARE_RACK_MEMBERSHIP_ADD_URL: LazyLock<String> =
LazyLock::new(|| {
format!("/v1/system/hardware/racks/{}/membership/add", RACK_UUID)
});
pub static HARDWARE_RACK_MEMBERSHIP_ABORT_URL: LazyLock<String> =
LazyLock::new(|| {
format!("/v1/system/hardware/racks/{}/membership/abort", RACK_UUID)
});
pub static DEMO_RACK_ADD_SLEDS_REQUEST: LazyLock<
params::RackMembershipAddSledsRequest,
> = LazyLock::new(|| params::RackMembershipAddSledsRequest {
sled_ids: BTreeSet::from([sled_hardware_types::BaseboardId {
serial_number: "demo-serial".to_string(),
part_number: "demo-part".to_string(),
}]),
});

pub static DEMO_SLED_PROVISION_POLICY: LazyLock<
params::SledProvisionPolicyParams,
> = LazyLock::new(|| params::SledProvisionPolicyParams {
Expand Down Expand Up @@ -136,16 +158,6 @@ pub static DEMO_LLDP_NEIGHBORS_URL: LazyLock<String> = LazyLock::new(|| {
)
});

// Rack membership
pub static DEMO_RACK_MEMBERSHIP_STATUS_URL: LazyLock<String> =
LazyLock::new(|| {
format!("/v1/system/hardware/racks/{}/membership", RACK_UUID)
});
pub static DEMO_RACK_MEMBERSHIP_ADD_URL: LazyLock<String> =
LazyLock::new(|| {
format!("/v1/system/hardware/racks/{}/membership/add", RACK_UUID)
});

// Alert resend
pub static DEMO_ALERT_RESEND_URL: LazyLock<String> = LazyLock::new(|| {
format!(
Expand Down Expand Up @@ -2839,6 +2851,27 @@ pub static VERIFY_ENDPOINTS: LazyLock<Vec<VerifyEndpoint>> = LazyLock::new(
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::Get],
},
VerifyEndpoint {
url: &HARDWARE_RACK_MEMBERSHIP_URL,
visibility: Visibility::Protected,
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::GetNonexistent],
},
VerifyEndpoint {
url: &HARDWARE_RACK_MEMBERSHIP_ADD_URL,
visibility: Visibility::Protected,
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::Post(
serde_json::to_value(&*DEMO_RACK_ADD_SLEDS_REQUEST)
.unwrap()
)],
},
VerifyEndpoint {
url: &HARDWARE_RACK_MEMBERSHIP_ABORT_URL,
visibility: Visibility::Protected,
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::Post(serde_json::Value::Null)],
},
VerifyEndpoint {
url: &HARDWARE_UNINITIALIZED_SLEDS,
visibility: Visibility::Public,
Expand Down Expand Up @@ -3604,21 +3637,6 @@ pub static VERIFY_ENDPOINTS: LazyLock<Vec<VerifyEndpoint>> = LazyLock::new(
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::GetNonexistent],
},
// Rack Membership
VerifyEndpoint {
url: &DEMO_RACK_MEMBERSHIP_STATUS_URL,
visibility: Visibility::Public,
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::GetNonexistent],
},
VerifyEndpoint {
url: &DEMO_RACK_MEMBERSHIP_ADD_URL,
visibility: Visibility::Public,
unprivileged_access: UnprivilegedAccess::None,
allowed_methods: vec![AllowedMethod::Post(serde_json::json!({
"sled_ids": [],
}))],
},
// Alert Delivery Resend
VerifyEndpoint {
url: &DEMO_ALERT_RESEND_URL,
Expand Down
Loading
Loading