[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301

jgallagher · 2025-06-10T15:58:26Z

During a config reconciliation pass, when we ought to delete datasets (but don't yet - #6177), we now run a zfs get ... and attempt to scan for datasets we ought to delete. These are accumulated into an in-memory set that will be reportable via inventory (coming in the subsequent PR).

smklein · 2025-06-10T19:18:33Z

common/src/api/internal/shared.rs

@@ -942,8 +942,6 @@ pub enum DatasetKind {

    // Other datasets
    Debug,
-    // Stores update artifacts (the "TUF Repo Depot")
-    Update,


sweet, thanks for the cleanup

smklein · 2025-06-10T19:20:58Z

common/src/disk.rs

+
+    #[proptest]
+    fn parse_dataset_name(pool_id: [u8; 16], kind: DatasetKind) {
+        let pool_id = ZpoolUuid::from_bytes(pool_id);


arguably this could be producing invalid UUIDv4s - e.g., the version/variant bits could be wrong, with an arbitrary byte input sequence - but I dunno if we care.

I don't think we care (nor should we, presumably?).

smklein · 2025-06-10T19:25:37Z

sled-agent/config-reconciler/src/dataset_serialization_task.rs

+                    // We _don't_ need to check children of
+                    // `{zpool}/{ZONE_DATASET}`: these are all transient, and
+                    // should be destroyed/created on demand.


To be super explicit, what's deleting these datasets today? Is it the deletion of the zone?

Nothing is explicitly deleting them today, although they do get cleaned up on a sled reboot; "should" is doing a lot of work here. We should also be destroying + recreating them if we stop / restart a zone, which I don't think is happening either. I'll file an issue or two for this.

Filed #8316

smklein · 2025-06-10T19:27:21Z

sled-agent/config-reconciler/src/dataset_serialization_task.rs

+            // Does this dataset have an ID set? If we created it, we expect it
+            // does, and we can easily check whether it still exists in our
+            // config. A dataset with a known `DatasetKind` without an ID set is
+            // pretty unexpected: we'll search through our config datasets to
+            // find a matching name; if we do, we'll assume it's that one.
+            let present_in_config = match properties.id {
+                Some(id) => datasets.contains_key(&id),
+                None => {
+                    if datasets.iter().any(|d| d.name == dataset) {
+                        warn!(
+                            self.log,
+                            "found on-disk dataset without an ID \
+                             that matches a config dataset by name";
+                            "dataset" => dataset.full_name(),
+                        );
+                        true
+                    } else {
+                        warn!(
+                            self.log,
+                            "found on-disk dataset without an ID \
+                             that doesn't match any config datasets; assuming \
+                             it should be marked as an orphan";
+                            "dataset" => dataset.full_name(),
+                        );
+                        false
+                    }
+                }
+            };
+            if present_in_config {
+                continue;
+            }


Up to you, but, this whole block could be marked as "for backwards compatibility" -- we could someday remove it when we eventually are confident all datasets have UUIDs

I'm not sure we can guarantee that - isn't "create the dataset" a separate step from "set the oxide:uuid property on a dataset" (and therefore we could succeed in creating but fail in setting the ID)?

And maybe a step beyond - even if that's wrong and we can create+set ID atomically, I don't think we have a reasonable way to represent reading ZFS datasets where an ID is required (i.e., I'd expect zfs.get_dataset_properties() to always return an Option<DatasetUuid>, which we have to handle somehow).

Yeah, good point. I suppose you're right, the lack of atomicity means this is possible (pretty sure you mentioned this in the sync today too, re: UUIDs embedded in names)

smklein · 2025-06-10T19:29:32Z

sled-agent/config-reconciler/src/dataset_serialization_task.rs

+                | DatasetKind::ClickhouseServer
+                | DatasetKind::ExternalDns
+                | DatasetKind::InternalDns => {
+                    // We should attempt to delete this; for now, just report it


I'm a little confused about the comments here - you're implying we "should delete" or "should refuse to delete", but this match statement just returns a "reason" with no label about whether deletion should proceed.

Is this a thing we're changing in a follow-up? Or a relic of a previous version?

Changing in a followup: the reason for the dataset being orphaned in this branch is "we don't delete datasets yet (see omicron#6177)". Addressing #6177 therefore turns into replacing this with "try to delete the dataset, and only report an orphan reason if deletion fails".

The other dataset kinds will not be deleted as a part of the followup work to address #6177.

Will open PR 3 of 2 momentarily that addresses this.

Builds on #8301. Adds the orphaned datasets gathered by `sled-agent-config-reconciler` to inventory, and adds an `omdb db inventory collections show $COLLECTION orphaned-datasets` filter to show them.

jgallagher added 10 commits June 10, 2025 11:21

pass current zpools into datasets_ensure (remove apparent TOCTOU)

316744b

add Zfs::list_datasets_full()

73bd563

add DatasetName::from_str()

f810e7e

prune DatasetKind::Update

36a6d82

add DatasetSerializationTask::datasets_report_orphans()

e2c859a

reconciler task: check for and remember orphaned datasets

ad71f20

attach reason to orphaned datasets

474ab32

check dataset IDs when looking for orphans

a72b4a0

record additional properties

8bd2ac2

remove Zfs::list_datasets_full()

b19b7ac

jgallagher requested review from andrewjstone and smklein June 10, 2025 15:58

jgallagher mentioned this pull request Jun 10, 2025

[sled-agent] Report orphaned datasets in inventory (PR 2/2) #8302

Merged

cargo fmt

d3024b6

smklein approved these changes Jun 10, 2025

View reviewed changes

This was referenced Jun 11, 2025

sled-agent: transient zone root datasets should be recreated during upgrades #8316

Open

DB schema: Remove update variant from database_kind #8268

Open

jgallagher merged commit 89ce370 into main Jun 12, 2025
17 checks passed

jgallagher deleted the john/sled-agent-config-reconciler-report-orphaned-datasets branch June 12, 2025 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301

[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301

Uh oh!

jgallagher commented Jun 10, 2025

Uh oh!

smklein Jun 10, 2025

Uh oh!

smklein Jun 10, 2025

Uh oh!

jgallagher Jun 12, 2025

Uh oh!

smklein Jun 10, 2025

Uh oh!

jgallagher Jun 10, 2025

Uh oh!

jgallagher Jun 11, 2025

Uh oh!

smklein Jun 10, 2025

Uh oh!

jgallagher Jun 10, 2025

Uh oh!

smklein Jun 10, 2025

Uh oh!

smklein Jun 10, 2025

Uh oh!

jgallagher Jun 10, 2025

Uh oh!

jgallagher Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301

[sled-agent-config-reconciler] Report orphaned datasets (PR 1/2) #8301

Uh oh!

Conversation

jgallagher commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!