Skip to content

[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jgallagher
Copy link
Contributor

During a config reconciliation pass, when we ought to delete datasets (but don't yet - #6177), we now run a zfs get ... and attempt to scan for datasets we ought to delete. These are accumulated into an in-memory set that will be reportable via inventory (coming in the subsequent PR).

@@ -942,8 +942,6 @@ pub enum DatasetKind {

// Other datasets
Debug,
// Stores update artifacts (the "TUF Repo Depot")
Update,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, thanks for the cleanup


#[proptest]
fn parse_dataset_name(pool_id: [u8; 16], kind: DatasetKind) {
let pool_id = ZpoolUuid::from_bytes(pool_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably this could be producing invalid UUIDv4s - e.g., the version/variant bits could be wrong, with an arbitrary byte input sequence - but I dunno if we care.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we care (nor should we, presumably?).

Comment on lines +479 to +481
// We _don't_ need to check children of
// `{zpool}/{ZONE_DATASET}`: these are all transient, and
// should be destroyed/created on demand.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be super explicit, what's deleting these datasets today? Is it the deletion of the zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is explicitly deleting them today, although they do get cleaned up on a sled reboot; "should" is doing a lot of work here. We should also be destroying + recreating them if we stop / restart a zone, which I don't think is happening either. I'll file an issue or two for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #8316

Comment on lines +507 to +537
// Does this dataset have an ID set? If we created it, we expect it
// does, and we can easily check whether it still exists in our
// config. A dataset with a known `DatasetKind` without an ID set is
// pretty unexpected: we'll search through our config datasets to
// find a matching name; if we do, we'll assume it's that one.
let present_in_config = match properties.id {
Some(id) => datasets.contains_key(&id),
None => {
if datasets.iter().any(|d| d.name == dataset) {
warn!(
self.log,
"found on-disk dataset without an ID \
that matches a config dataset by name";
"dataset" => dataset.full_name(),
);
true
} else {
warn!(
self.log,
"found on-disk dataset without an ID \
that doesn't match any config datasets; assuming \
it should be marked as an orphan";
"dataset" => dataset.full_name(),
);
false
}
}
};
if present_in_config {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but, this whole block could be marked as "for backwards compatibility" -- we could someday remove it when we eventually are confident all datasets have UUIDs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can guarantee that - isn't "create the dataset" a separate step from "set the oxide:uuid property on a dataset" (and therefore we could succeed in creating but fail in setting the ID)?

And maybe a step beyond - even if that's wrong and we can create+set ID atomically, I don't think we have a reasonable way to represent reading ZFS datasets where an ID is required (i.e., I'd expect zfs.get_dataset_properties() to always return an Option<DatasetUuid>, which we have to handle somehow).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I suppose you're right, the lack of atomicity means this is possible (pretty sure you mentioned this in the sync today too, re: UUIDs embedded in names)

| DatasetKind::ClickhouseServer
| DatasetKind::ExternalDns
| DatasetKind::InternalDns => {
// We should attempt to delete this; for now, just report it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the comments here - you're implying we "should delete" or "should refuse to delete", but this match statement just returns a "reason" with no label about whether deletion should proceed.

Is this a thing we're changing in a follow-up? Or a relic of a previous version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing in a followup: the reason for the dataset being orphaned in this branch is "we don't delete datasets yet (see omicron#6177)". Addressing #6177 therefore turns into replacing this with "try to delete the dataset, and only report an orphan reason if deletion fails".

The other dataset kinds will not be deleted as a part of the followup work to address #6177.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open PR 3 of 2 momentarily that addresses this.

@jgallagher jgallagher merged commit 89ce370 into main Jun 12, 2025
17 checks passed
@jgallagher jgallagher deleted the john/sled-agent-config-reconciler-report-orphaned-datasets branch June 12, 2025 00:44
jgallagher added a commit that referenced this pull request Jun 12, 2025
Builds on #8301.

Adds the orphaned datasets gathered by `sled-agent-config-reconciler` to
inventory, and adds an `omdb db inventory collections show $COLLECTION
orphaned-datasets` filter to show them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants