fixes for ephemeral storage#822
Conversation
Add additional context to errors, and avoid unnecessary conversions at call sites. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Make "/mnt/.ephemeral" a constant and avoid unnecessary conversions. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Separate the logic for bind mounts into two phases: first, determine which bind mounts are required; second, perform the mounts. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Mount ephemeral storage with the same options used for `/local` on the host, to ensure that similar restrictions are applied regardless of the backing store for directories like `/var/lib/containerd`. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Add an additional phase that "masks" the underlying directories for bind mounts on ephemeral storage by over-mounting them with `/srv`. This suppresses the near-duplicate entries in the host's mount table, where mounts under `/var/lib/kubelet` would also be listed under `/mnt/ephemeral/._var_lib_kubelet`. This mimics the masking done for `/mnt`, `/opt`, and `/var` on the host's data partition. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Similar to the idempotency fix in f75240e ("apiserver: overwrite symlink in ephemeral storage init") but handles the encrypted storage case where there's an additional symlink for state-tracking purposes. Signed-off-by: Ben Cressey <bcressey@amazon.com>
| if std::fs::exists(EPHEMERAL_DATA_LINK).is_ok_and(|x| x) { | ||
| std::fs::remove_file(EPHEMERAL_DATA_LINK).context(error::DiskUnlinkFailureSnafu {})?; | ||
| } |
There was a problem hiding this comment.
Q - for the idempotency problem, I wonder if we can introduce any testing to avoid future regression. Unit test might be hard as we might need to introduce a series of mocking to be able to test functions like encrypt_ephemeral_device.
| let is_source_masked = is_masked(&source_dir)?; | ||
| let is_target_mounted = is_mounted(target_dir)?; | ||
|
|
||
| match (is_target_mounted, is_source_masked) { | ||
| (true, true) => continue, | ||
| (true, false) => { | ||
| dirs_to_mask.insert(source_dir.into()); | ||
| } | ||
| (false, false) => { | ||
| dirs_to_bind.insert((source_dir.clone(), target_dir.clone())); | ||
| dirs_to_mask.insert(source_dir.into_os_string()); | ||
| } | ||
| (false, true) => error::DirectoryAlreadyMaskedSnafu { dir: source_dir }.fail()?, |
There was a problem hiding this comment.
Can we wrap this logic in a modular function and unit test them? Right now they are not quite testable if included as part of bind().
There was a problem hiding this comment.
Can we wrap this logic in a modular function and unit test them? Right now they are not quite testable if included as part of bind().
Making this unit testable is a good idea, but it's a larger refactor than I want to tackle here. Are you OK with a backlog issue?
There was a problem hiding this comment.
sure, this is not a blocking comment!
| let mut dirs_to_bind = HashSet::new(); | ||
| let mut dirs_to_mask = HashSet::new(); |
There was a problem hiding this comment.
Q - does the order matter here for the dirs to be binded/masked? If yes, we should probably use Vec<> instead.
There was a problem hiding this comment.
The only problem I can see with this is if mount points are children of another.
Which makes me wonder about adding additional input sanity on overlapping paths and if we want to protect against that.
/mnt/test and /mnt/test/sub - technically with a hashset mount order is non-deterministic, which with overlapping paths could make data inaccessible.
We could catch that as invalid 🤔
Details
for (i, dir_a) in dirs.iter().enumerate() {
let path_a = Path::new(dir_a);
for dir_b in dirs.iter().skip(i + 1) {
let path_b = Path::new(dir_b);
if path_a.starts_with(path_b) || path_b.starts_with(path_a) {
return error::InvalidParameterSnafu {
parameter: format!("{}, {}", dir_a, dir_b),
reason: "overlapping bind directories not allowed",
}
.fail();
}
}
}
This problem exists today though - but across multiple calls:
apiclient ephemeral-storage bind --targets /mnt/a/sub
# later...
apiclient ephemeral-storage bind --targets /mnt/a
I could see this happening across multiple bootstrap containers that run and set different binds (where we don't guarantee launch order)
There was a problem hiding this comment.
Ordering matters for bind mounts, but not for the masking part since masking happens on ._path_to_dir.
Should be easy to fix the mount ordering, e.g. by sorting, I can add that.
There was a problem hiding this comment.
I could see this happening across multiple bootstrap containers that run and set different binds (where we don't guarantee launch order)
Fixing this - or at least detecting it - across multiple calls is a reasonable idea, but will have to wait for a future change 😀
I can fix it for the single call case though.
| /// binds the specified directories to the pre-configured array, creating those directories if | ||
| /// Binds the specified directories to the pre-configured array, creating those directories if | ||
| /// they do not exist. | ||
| pub fn bind(variant: &str, dirs: Vec<String>) -> Result<()> { |
There was a problem hiding this comment.
nit: With more logic adding to bind() it has become a rather large function. To maintain readability, should we consider modularize some of the new code added here?
Ensure that `/mnt/parent` is mounted before `/mnt/parent/child`, as otherwise the parent's storage may be mounted over the child's, and files could be stored in the wrong location. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Issue number:
N/A
Description of changes:
Refactor the implementation for the ephemeral storage API and fix four problems.
/mnt/ephemeralwas previously mounted without the "private" option, leading to mount amplification as every directory under/var/lib/kubeletwould appear twice. Parsing mount information can be a bottleneck forsystemdon busy systems.Mounting
/mnt/ephemeralwithout the "private" option leads to another issue, which is that paths like/mnt/ephemeral/._var_lib_kubeletwould show some of the same content as/var/lib/kubelet, but not all, because pod mounts will no longer propagate back to the parent. Rather than accept the inconsistency, we mount over the parent to "mask" it with a read-only mount and prevent it from being used as another way to view the content.See also the notes in bottlerocket-os/bottlerocket#2240 from when the same change was made for
/localand/var.#395 included a fix to make the API idempotent, but #721 introduced a regression when initializing encrypted storage. Fix that in the same way, by removing the symlink if it already exists.
Bind mounts might not happen in the correct order - parent directories before children. Sort bind mounts first to ensure this.
Testing done:
Mounts no longer propagate back to /mnt/ephemeral
Before:
After:
Mounts have expected options
Before:
After:
Init API is idempotent
Before:
After (later boot phase, expected failure):
After (same boot phase, success):
Parent directories mounted before children
Before:
After:
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.