Conversation
sled-agent/config-reconciler/src/reconciler_task/external_disks.rs
Outdated
Show resolved
Hide resolved
This commit adds a 3 phase mechanism for sled expungement. The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for commit the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before. The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving! The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that that the last committed trust quorum configuration does not contain the sled that is to be expunged. The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked. This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in. This also builds on #9741 and should merge after that PR.
eb41801 to
c070803
Compare
When Trust Quorum commits a new epoch, all U.2 crypt datasets must have their encryption keys rotated to use the new epoch's derived key. This change implements the key rotation flow triggered by epoch commits. ## Trust Quorum Integration - Add watch channel to `NodeTaskHandle` for epoch change notifications - Initialize channel with current committed epoch on startup - Notify subscribers via `send_if_modified()` when epoch changes ## Config Reconciler Integration - Accept `committed_epoch_rx` watch channel from trust quorum - Trigger reconciliation when epoch changes - Track per-disk encryption epoch in `ExternalDisks` - Add `rekey_for_epoch()` to coordinate key rotation: - Filter disks needing rekey (cached epoch < target OR unknown) - Derive keys for each disk via `StorageKeyRequester` - Send batch request to dataset task - Update cached epochs on success - Retry on failure via normal reconciliation retry logic ## Dataset Task Changes - Add `RekeyRequest`/`RekeyResult` types for batch rekey operations - Add `datasets_rekey()` with idempotency check (skip if already at target) - Use `Zfs::change_key()` for atomic key + epoch property update ## ZFS Utilities - Add `Zfs::change_key()` using `zfs_atomic_change_key` crate - Add `Zfs::load_key()`, `unload_key()`, `dataset_exists()` - Add `epoch` field to `DatasetProperties` - Add structured error types for key operations ## Crash Recovery - Add trial decryption recovery in `sled-storage` for datasets with missing epoch property (e.g., crash during initial creation) - Unload key before each trial attempt to handle crash-after-load-key - Set epoch property after successful recovery ## Safety Properties - Atomic: Key and epoch property set together via `zfs_atomic_change_key` - Idempotent: Skip rekey if dataset already at target epoch - Crash-safe: Epoch read from ZFS on restart rebuilds cache correctly - Conservative: Unknown epochs (None) trigger rekey attempt
c070803 to
e45b10d
Compare
1a9b8cf to
291e41e
Compare
Create a new key-manager-types crate containing the disk encryption key types (Aes256GcmDiskEncryptionKey and VersionedAes256GcmDiskEncryptionKey) that were previously defined in key-manager. This breaks the dependency from illumos-utils to key-manager, allowing illumos-utils to depend only on the minimal types crate. The key-manager crate re-exports VersionedAes256GcmDiskEncryptionKey for backwards compatibility.
291e41e to
ad6274e
Compare
sled-agent/config-reconciler/src/reconciler_task/external_disks.rs
Outdated
Show resolved
Hide resolved
| /// Returns (exists=true, mounted) if the dataset exists with the expected | ||
| /// mountpoint, (exists=false, mounted=false) if it doesn't exist. | ||
| /// Returns an error if the dataset exists but has an unexpected mountpoint. | ||
| async fn dataset_mount_info( |
There was a problem hiding this comment.
I renamed this internal function because it was previously named dataset_exists and I wanted to use that name, and it was not a good name for what this function actually returns (DatasetMountInfo).
andrewjstone
left a comment
There was a problem hiding this comment.
This looks fantastic @plaidfinch.
We'll need to test this and possibly coordinate merging it in case it relies on trust quorum being enabled / lrtq upgrade, etc...
| impl DatasetProperties { | ||
| const ZFS_GET_PROPS: &'static str = | ||
| "oxide:uuid,name,mounted,avail,used,quota,reservation,compression"; | ||
| const ZFS_GET_PROPS: &'static str = "oxide:uuid,oxide:epoch,name,mounted,avail,used,quota,reservation,compression"; |
There was a problem hiding this comment.
This needs to be formatted.
| log, | ||
| "Rotating encryption key"; | ||
| "dataset" => &req.dataset_name, | ||
| "new_epoch" => new_epoch, |
There was a problem hiding this comment.
Worth logging the current epoch?
There was a problem hiding this comment.
could just add the current epoch to the new logging scope that's created above, perhaps...
| log, | ||
| "Failed to rotate encryption key"; | ||
| "dataset" => &req.dataset_name, | ||
| "error" => %e, |
There was a problem hiding this comment.
Worth logging the old and new epochs here?
| /// Returns true if any rekey operations failed. | ||
| async fn rekey_for_epoch(&mut self, target_epoch: Epoch) -> bool { | ||
| // Filter to disks that need rekeying: | ||
| // - Known epoch < target: definitely needs rekey |
There was a problem hiding this comment.
I don't see how this would be possible, but I"m curious if we should log an error in case e > target_epoch, just to cover all conditions.
| } | ||
|
|
||
| if request.disks.is_empty() { | ||
| info!( |
There was a problem hiding this comment.
Should this be warn! ?
| format!("{}/{}", info.disk.zpool_name(), CRYPT_DATASET); | ||
| match self | ||
| .key_requester | ||
| .get_key(target_epoch.0, info.disk.identity().clone()) |
There was a problem hiding this comment.
This can block indefinitely, while waiting to load the secret. I don't think that's a problem but wanted to mention it. I think other things can block in the reconciler also.
|
|
||
| // Get the key for this epoch | ||
| let key = | ||
| match key_requester.get_key(epoch, disk_identity.clone()).await { |
There was a problem hiding this comment.
It may be worth it at some point to get the set of known committed epochs when loading the latest key. Loading the latest key decrypts and loads all prior epochs, so this should be cheap. Then if we aborted a bunch for some reason we wouldn't need to make those request. This is just an optimization though and not necessary to do in this PR.
| pub compression: String, | ||
| /// The encryption key epoch for this dataset. | ||
| /// | ||
| /// Only present on encrypted datasets (e.g., crypt datasets on U.2s). |
There was a problem hiding this comment.
Maybe worth noting here that this is not present on datasets that inherit encryption from a parent (if I understand that correctly)?
| #[derive(Debug, thiserror::Error)] | ||
| pub enum KeyRotationError { | ||
| #[error("failed to rotate encryption key for dataset {dataset}")] | ||
| ChangeKeyFailed { |
There was a problem hiding this comment.
This is fine as-is, but if this is the only variant this error will ever have, we could shorten it a bit by making it a struct?
#[derive(Debug, thiserror::Error)]
#[error("failed to rotate encryption key for dataset {dataset}")]
pub struct KeyRotationError {
dataset: String,
#[source]
err: anyhow::Error,
}| #[derive(Debug, Default)] | ||
| pub struct RekeyRequest { | ||
| /// Datasets to rekey, keyed by physical disk UUID. | ||
| pub disks: BTreeMap<PhysicalDiskUuid, DatasetRekeyInfo>, |
There was a problem hiding this comment.
I think this is fine but just making sure I understand - we only expect to ever have one encrypted dataset per disk, and any datasets that need encryption will be children of it (and inherit its encryption), right?
| // Ensure we process the initial epoch on startup. Using mark_unchanged() | ||
| // means that changed() will fire on the first value, allowing us to | ||
| // catch any missed rekeys from crashes. | ||
| self.committed_epoch_rx.mark_unchanged(); |
There was a problem hiding this comment.
I don't think we need this now that we moved rekeying into do_reconciliation; we're about to go into the loop which unconditionally starts with a call to do_reconciliation, which means we will immediately try to rekey if necessary. As written this will induce a second do_reconciliation (which is harmless but not useful).
| // Check if any disks need rekeying to the current committed epoch. | ||
| // We use borrow_and_update() to mark the epoch as seen, so we don't | ||
| // trigger another reconciliation for the same epoch change. | ||
| // Note: We must copy the epoch out of the Ref before any await points. |
There was a problem hiding this comment.
// Note: We must copy the epoch out of the Ref before any await points.
This is true but I think in general we want the time we hold the watch channel to be as short as possible. If we did something like:
let ref = watch_rx.borrow_and_update();
do_a_synchronous_thing(ref);
drop(ref);
do_an_async_thing().await;we still keep the lock on the channel held throughout do_synchronous_thing, which we'd prefer to avoid. It's wonderful when watch channels hold Copy types like this one and we can just immediately deref as you have here.
| let dataset_name = | ||
| format!("{}/{}", info.disk.zpool_name(), CRYPT_DATASET); |
There was a problem hiding this comment.
Turbo-nit - maybe move this down into the Ok(key) => { ... } branch? That'd minimize the scope of the variable and let us skip the allocation if we don't need it.
| let name = format!("{}/{}", zpool_name, dataset); | ||
| let epoch = if let Ok(epoch_str) = | ||
| Zfs::get_oxide_value(dataset, "epoch").await | ||
| Zfs::get_oxide_value(&name, "epoch").await |
There was a problem hiding this comment.
Was this wrong before and always failing?
There was a problem hiding this comment.
Ha yes. Claude found this one yesterday and Finch pointed it out to me. It was always wrong, but innocuous because LRTQ doesn't allow key rotations.
| let mut command = tokio::process::Command::new(illumos_utils::zfs::ZFS); | ||
| let cmd = command.args(&["list", "-H", dataset]); | ||
| Ok(cmd.status().await?.success()) | ||
| Ok(Zfs::dataset_exists(dataset).await?) |
There was a problem hiding this comment.
At this point should we remove this helper and inline Zfs::dataset_exists wherever it's called?
| // loaded keys across process restarts, so if we crashed after a | ||
| // successful load-key but before setting the epoch property, the | ||
| // correct key would still be loaded and our load-key would fail. | ||
| let _ = Zfs::unload_key(dataset_name).await; |
There was a problem hiding this comment.
What happens if this fails?
| "epoch" => epoch, | ||
| ); | ||
|
|
||
| // No need to unload here -- we unload at the start of each iteration |
There was a problem hiding this comment.
Do we need to unload on the last iteration?
|
We tested this on a4x2 with TQ enabled and keys were created correctly during RSS. We then added a sled and rotation completed correctly. We're going to relaunch with trust quorum disabled and see if we can upgrade out of lrtq successfully and then add a sled in our next experiment. |
| .await | ||
| .map_err(|e| ChangeKeyError { | ||
| name: dataset.to_string(), | ||
| err: anyhow::anyhow!("{e}"), |
There was a problem hiding this comment.
i think this should be
| err: anyhow::anyhow!("{e}"), | |
| err: anyhow::Error::from(e), |
rather than formatting it, as that will eat the structured representation of the error's source chain, while Error::from preserves it.
| log, | ||
| "Rotating encryption key"; | ||
| "dataset" => &req.dataset_name, | ||
| "new_epoch" => new_epoch, |
There was a problem hiding this comment.
could just add the current epoch to the new logging scope that's created above, perhaps...
| .filter(|i| match i.cached_epoch { | ||
| Some(e) => e < target_epoch, | ||
| None => true, | ||
| }) |
There was a problem hiding this comment.
turbo nit: Option<T: PartialEq> is itself PartialEq, and None compares less than any Some, so this can just be
| .filter(|i| match i.cached_epoch { | |
| Some(e) => e < target_epoch, | |
| None => true, | |
| }) | |
| .filter(|i| i.cached_epoch < Some(target_epoch) |
|
Woohoo. Tested on hardware. LRTQ upgrade succeeded. This required a small patch, which we should pull into this branch: 494d49a I think it's probably good to go in once the comments are resolved and this patch is pulled in. Will re-assess when fresh tomorrow. |
When Trust Quorum commits a new epoch, all U.2 crypt datasets must have their encryption keys rotated to use the new epoch's derived key. This change implements the key rotation flow triggered by epoch commits.
Trust Quorum Integration
NodeTaskHandlefor epoch change notificationssend_if_modified()when epoch changesConfig Reconciler Integration
committed_epoch_rxwatch channel from trust quorumExternalDisksrekey_for_epoch()to coordinate key rotation:StorageKeyRequesterDataset Task Changes
RekeyRequest/RekeyResulttypes for batch rekey operationsdatasets_rekey()with idempotency check (skip if already at target)Zfs::change_key()for atomic key + epoch property updateZFS Utilities
Zfs::change_key()usingzfs_atomic_change_keycrateZfs::load_key(),unload_key(),dataset_exists()epochfield toDatasetPropertiesCrash Recovery
sled-storagefor datasets with missing epoch property (e.g., crash during initial creation)