-
Notifications
You must be signed in to change notification settings - Fork 50
[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301
Conversation
@@ -942,8 +942,6 @@ pub enum DatasetKind { | |||
|
|||
// Other datasets | |||
Debug, | |||
// Stores update artifacts (the "TUF Repo Depot") | |||
Update, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet, thanks for the cleanup
|
||
#[proptest] | ||
fn parse_dataset_name(pool_id: [u8; 16], kind: DatasetKind) { | ||
let pool_id = ZpoolUuid::from_bytes(pool_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arguably this could be producing invalid UUIDv4s - e.g., the version/variant bits could be wrong, with an arbitrary byte input sequence - but I dunno if we care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we care (nor should we, presumably?).
// We _don't_ need to check children of | ||
// `{zpool}/{ZONE_DATASET}`: these are all transient, and | ||
// should be destroyed/created on demand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be super explicit, what's deleting these datasets today? Is it the deletion of the zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing is explicitly deleting them today, although they do get cleaned up on a sled reboot; "should" is doing a lot of work here. We should also be destroying + recreating them if we stop / restart a zone, which I don't think is happening either. I'll file an issue or two for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #8316
// Does this dataset have an ID set? If we created it, we expect it | ||
// does, and we can easily check whether it still exists in our | ||
// config. A dataset with a known `DatasetKind` without an ID set is | ||
// pretty unexpected: we'll search through our config datasets to | ||
// find a matching name; if we do, we'll assume it's that one. | ||
let present_in_config = match properties.id { | ||
Some(id) => datasets.contains_key(&id), | ||
None => { | ||
if datasets.iter().any(|d| d.name == dataset) { | ||
warn!( | ||
self.log, | ||
"found on-disk dataset without an ID \ | ||
that matches a config dataset by name"; | ||
"dataset" => dataset.full_name(), | ||
); | ||
true | ||
} else { | ||
warn!( | ||
self.log, | ||
"found on-disk dataset without an ID \ | ||
that doesn't match any config datasets; assuming \ | ||
it should be marked as an orphan"; | ||
"dataset" => dataset.full_name(), | ||
); | ||
false | ||
} | ||
} | ||
}; | ||
if present_in_config { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, but, this whole block could be marked as "for backwards compatibility" -- we could someday remove it when we eventually are confident all datasets have UUIDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can guarantee that - isn't "create the dataset" a separate step from "set the oxide:uuid
property on a dataset" (and therefore we could succeed in creating but fail in setting the ID)?
And maybe a step beyond - even if that's wrong and we can create+set ID atomically, I don't think we have a reasonable way to represent reading ZFS datasets where an ID is required (i.e., I'd expect zfs.get_dataset_properties()
to always return an Option<DatasetUuid>
, which we have to handle somehow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I suppose you're right, the lack of atomicity means this is possible (pretty sure you mentioned this in the sync today too, re: UUIDs embedded in names)
| DatasetKind::ClickhouseServer | ||
| DatasetKind::ExternalDns | ||
| DatasetKind::InternalDns => { | ||
// We should attempt to delete this; for now, just report it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the comments here - you're implying we "should delete" or "should refuse to delete", but this match statement just returns a "reason" with no label about whether deletion should proceed.
Is this a thing we're changing in a follow-up? Or a relic of a previous version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing in a followup: the reason
for the dataset being orphaned in this branch is "we don't delete datasets yet (see omicron#6177)". Addressing #6177 therefore turns into replacing this with "try to delete the dataset, and only report an orphan reason if deletion fails".
The other dataset kinds will not be deleted as a part of the followup work to address #6177.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open PR 3 of 2 momentarily that addresses this.
Builds on #8301. Adds the orphaned datasets gathered by `sled-agent-config-reconciler` to inventory, and adds an `omdb db inventory collections show $COLLECTION orphaned-datasets` filter to show them.
During a config reconciliation pass, when we ought to delete datasets (but don't yet - #6177), we now run a
zfs get ...
and attempt to scan for datasets we ought to delete. These are accumulated into an in-memory set that will be reportable via inventory (coming in the subsequent PR).