Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Mar 27, 2025

Partial fix of #7874 and #4203

The debug dataset has not been mounted post-reboot due to logic issues outlined in #7874 (comment). This PR resolves those issues, so that the sled agent attempts to mount non-zoned datasets.

Note: This PR makes no attempt to delete existing crypt/debug directories.
If files exist under the desired mountpoint, this PR makes no attempt to remove them.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we find a mount point that we can't mount our dataset on, we see a message like this:

23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
    err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
    file = sled-storage/src/resources.rs:445

Should that be an ERRO and not a WARN?
It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.

logctx.cleanup_successful();
}

async fn is_mounted(dataset: &DatasetName) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about the unwrap's here, but it looks like this is a test only function.
Would it make sense to #[cfg(test)]?

Same with unmount() below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already in a mod test block, which is already compiled with #[cfg(test)]

}

if !config.name.kind().zoned() && !old_dataset.mounted {
info!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we hit this (I don't see it in the error case I tested)?
Should this be a warning? It seems like if we hit this we just ignore it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being called when "omicron_config" PUT is called, via datasets_ensure -> datsaet_ensure.

Basically: If someone tries to ask for a dataset, but it is not mounted, we'll try to ensure that it gets mounted.

This is necessary for my test below, where I do the following operations:

  • Ask for a dataset to exist
  • Unmount that dataset, manually
  • Ask for that dataset to exist once more
  • Observe that it becomes mounted

Is this likely in practice? no. But it's one case of "the config you asked for hasn't been 100% applied, so we should try to make it happen".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it makes no difference, then I would at least throw a warning or an error. If it ever does end up being called in production, then we would have a better chance of seeing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple ways we could see this in production (one theoretical, one actual but unlikely). Both are related to the fact that we currently don't read the ledgered DatasetConfig on sled-agent startup, so as Sean said above we only enter this codepath in response to a PUT /omicron-config. (Not reading the ledgered dataset config is also the root cause of #7546, I believe.)

During a cold boot, sled-agent will try to mount:

  • The crypt dataset
  • The crypt/debug dataset

and then as it starts up each zone, it will mount each zone's durable dataset (if it has one) and transient root dataset (always).

Back to the two ways we could go down this code path:

  1. (Entirely theoretical) If the DatasetConfig we get from Nexus contained datasets other than crypt, crypt/debug, or a dataset related to a zone, we'd try to mount it here. I don't believe DatasetConfig has any such datasets, though.
  2. (Practical but unlikely) If we got a PUT /omicron-config after sled-agent started accepting HTTP requests but before it had launched all its zones, we'd go through this codepath to mount any zone-related datasets that hadn't been mounted yet.

#7884 is also related: sled-agent won't mount support bundle datasets after a cold boot, but none of the above about reading the ledgered DatasetConfig will help with that, because the support bundle datasets aren't in that config.

TL;DR: We could see this in production in some weird cases. I think those cases are related to preexisting bugs (i.e., not reading DatasetConfig on launch and related fallout from that). These aren't particularly bad, and this PR doesn't make them any worse. We have ongoing work to address them, but that work shouldn't be an R14 blocker.

@smklein
Copy link
Collaborator Author

smklein commented Mar 31, 2025

When we find a mount point that we can't mount our dataset on, we see a message like this:

23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
    err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
    file = sled-storage/src/resources.rs:445

Should that be an ERRO and not a WARN? It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.

I could go either way on this - it is technically not fatal for the sled agent, but it does lead to us not using the disk?

Regardless, I'm happy to make this an "error" to make it more visible in the logs?

@leftwo
Copy link
Contributor

leftwo commented Mar 31, 2025

When we find a mount point that we can't mount our dataset on, we see a message like this:

23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
    err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
    file = sled-storage/src/resources.rs:445

Should that be an ERRO and not a WARN? It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.

I could go either way on this - it is technically not fatal for the sled agent, but it does lead to us not using the disk?

Regardless, I'm happy to make this an "error" to make it more visible in the logs?

Yeah, it's fatal for the overall control plane if there are other things on this disk that would have been mounted, but not for sled agent specifically.

I wanted it an error to make it easier to find for the poor soul debugging this. It might not be obvious what the problem is at first glances, and having an ERRO in the logs might get some attention.

@smklein
Copy link
Collaborator Author

smklein commented Mar 31, 2025

I wanted it an error to make it easier to find for the poor soul debugging this. It might not be obvious what the problem is at first glances, and having an ERRO in the logs might get some attention.

Done in a3a7215

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should line this up with 7888 and have them land in quick succession.
And, also be sure that we know what to do on dogfood (and colo) (and customers) so we are ready when these changes land on those racks.

@smklein smklein merged commit 58f4fc3 into main Apr 3, 2025
16 checks passed
@smklein smklein deleted the do-more-mounting branch April 3, 2025 20:38
smklein added a commit that referenced this pull request Apr 10, 2025
Related to #7887 - we have
seen that datasets nested within
encrypted datasets are not automatically mounted on reboot.

To mitigate this issue: whenever we perform an operation which attempts
to read the dataset, ensure
that it is mounted. To do this operation, some minor refactors were made
to mounting-related code.

Fixes #7884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants