Skip to content

fixes for ephemeral storage#822

Open
bcressey wants to merge 7 commits intobottlerocket-os:developfrom
bcressey:ephemeral-storage-fixes
Open

fixes for ephemeral storage#822
bcressey wants to merge 7 commits intobottlerocket-os:developfrom
bcressey:ephemeral-storage-fixes

Conversation

@bcressey
Copy link
Contributor

@bcressey bcressey commented Feb 4, 2026

Issue number:
N/A

Description of changes:
Refactor the implementation for the ephemeral storage API and fix four problems.

/mnt/ephemeral was previously mounted without the "private" option, leading to mount amplification as every directory under /var/lib/kubelet would appear twice. Parsing mount information can be a bottleneck for systemd on busy systems.

Mounting /mnt/ephemeral without the "private" option leads to another issue, which is that paths like /mnt/ephemeral/._var_lib_kubelet would show some of the same content as /var/lib/kubelet, but not all, because pod mounts will no longer propagate back to the parent. Rather than accept the inconsistency, we mount over the parent to "mask" it with a read-only mount and prevent it from being used as another way to view the content.

See also the notes in bottlerocket-os/bottlerocket#2240 from when the same change was made for /local and /var.

#395 included a fix to make the API idempotent, but #721 introduced a regression when initializing encrypted storage. Fix that in the same way, by removing the symlink if it already exists.

Bind mounts might not happen in the correct order - parent directories before children. Sort bind mounts first to ensure this.

Testing done:

Mounts no longer propagate back to /mnt/ephemeral

Before:

bash-5.2# findmnt -R /mnt -o TARGET,PROPAGATION,OPTIONS
TARGET                                                                                                          PROPAGATION OPTIONS
/mnt                                                                                                            shared      rw,nosuid,nodev,noatime,seclabel,attr2,inod
`-/mnt/.ephemeral                                                                                               shared      rw,relatime,seclabel,attr2,inode64,logbufs=
  |-/mnt/.ephemeral/._var_lib_kubelet/pods/5bf2ab9e-e6c6-4e37-b90f-874cafaafd4c/volumes/kubernetes.io~projected/kube-api-access-46q8s
  |                                                                                                             shared      rw,relatime,seclabel,size=409600k,noswap
  |-/mnt/.ephemeral/._var_lib_kubelet/pods/82eae8b0-5a7a-4b7c-9437-b5961c5cde14/volumes/kubernetes.io~projected/kube-api-access-gtnh9
  |                                                                                                             shared      rw,relatime,seclabel,size=6933428k,noswap
  |-/mnt/.ephemeral/._var_lib_kubelet/pods/0a20da32-a3aa-4617-9f98-bd2057450f51/volumes/kubernetes.io~projected/kube-api-access-j29rg

After:

bash-5.2#  findmnt -R /mnt -o TARGET,PROPAGATION,OPTIONS
TARGET                                PROPAGATION OPTIONS                                                                                                              /mnt                                  shared      rw,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=8,swidth=8,noquota
`-/mnt/.ephemeral                     private     rw,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
  |-/mnt/.ephemeral/._var_lib_kubelet private     ro,nosuid,nodev,noexec,relatime,seclabel,user_xattr,acl,cache_strategy=readaround                                      `-/mnt/.ephemeral/._var_log_pods    private     ro,nosuid,nodev,noexec,relatime,seclabel,user_xattr,acl,cache_strategy=readaround
Mounts have expected options

Before:

bash-5.2# findmnt /var/lib/kubelet/ -o TARGET,OPTIONS
TARGET           OPTIONS
/var/lib/kubelet rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota

After:

bash-5.2# findmnt /var/lib/kubelet/ -o TARGET,OPTIONS
TARGET           OPTIONS
/var/lib/kubelet rw,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
Init API is idempotent

Before:

bash-5.2# apiclient ephemeral-storage init
Failed to initialize ephemeral storage: Failed POST request to '/actions/ephemeral-storage/init': Status 500 when POSTing /actions/ephemeral-storage/init: Unable to initialize ephemeral storage: Failed to create disk symlink File exists (os error 17)

After (later boot phase, expected failure):

bash-5.2# apiclient ephemeral-storage init
Failed to initialize ephemeral storage: Failed POST request to '/actions/ephemeral-storage/init': Status 500 when POSTing /actions/ephemeral-storage/init: Unable to initialize ephemeral storage: Failed to run 'rottweiler' with args 'attach block-device /dev/disk/EPHEMERAL-DATA ephemeral-storage' on device '/dev/disk/EPHEMERAL-DATA': stdout: , stderr: Error: /usr/bin/systemd-creds failed with exit code 1: Failed to unseal secret using TPM2: Operation not permitted

After (same boot phase, success):

bash-5.3# apiclient ephemeral-storage init
bash-5.3# apiclient ephemeral-storage init
Parent directories mounted before children

Before:

bash-5.3# apiclient ephemeral-storage bind --dirs /mnt/foo/bar /mnt/foo
Failed to initialize ephemeral storage: Failed POST request to '/actions/ephemeral-storage/bind': Status 500 when POSTing /actions/ephemeral-storage/bind: Unable to bind ephemeral storage: Failed to bind directory '/mnt/.ephemeral/._mnt_foo_bar' to '/mnt/foo/bar': mount: /mnt/foo/bar: mount point does not exist.

After:

bash-5.3# apiclient ephemeral-storage bind --dirs /mnt/foo/bar/baz /mnt/foo /mnt/bar
bash-5.3# findmnt -R /mnt -o TARGET,SOURCE
TARGET                                SOURCE
/mnt                                  /dev/mapper/BOTTLEROCKET-DATA[/mnt]
├─/mnt/.ephemeral                     /dev/mapper/EPHEMERAL-DATA
│ ├─/mnt/.ephemeral/._mnt_foo         /dev/dm-0[/srv]
│ ├─/mnt/.ephemeral/._mnt_bar         /dev/dm-0[/srv]
│ └─/mnt/.ephemeral/._mnt_foo_bar_baz /dev/dm-0[/srv]
├─/mnt/bar                            /dev/mapper/EPHEMERAL-DATA[/._mnt_bar]
└─/mnt/foo                            /dev/mapper/EPHEMERAL-DATA[/._mnt_foo]
  └─/mnt/foo/bar/baz                  /dev/mapper/EPHEMERAL-DATA[/._mnt_foo_bar_baz]

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Add additional context to errors, and avoid unnecessary conversions
at call sites.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
Make "/mnt/.ephemeral" a constant and avoid unnecessary conversions.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
Separate the logic for bind mounts into two phases: first, determine
which bind mounts are required; second, perform the mounts.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
Mount ephemeral storage with the same options used for `/local` on
the host, to ensure that similar restrictions are applied regardless
of the backing store for directories like `/var/lib/containerd`.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
Add an additional phase that "masks" the underlying directories for
bind mounts on ephemeral storage by over-mounting them with `/srv`.

This suppresses the near-duplicate entries in the host's mount table,
where mounts under `/var/lib/kubelet` would also be listed under
`/mnt/ephemeral/._var_lib_kubelet`.

This mimics the masking done for `/mnt`, `/opt`, and `/var` on the
host's data partition.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
Similar to the idempotency fix in f75240e ("apiserver: overwrite
symlink in ephemeral storage init") but handles the encrypted storage
case where there's an additional symlink for state-tracking purposes.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
@bcressey bcressey requested review from KCSesh and ytsssun February 4, 2026 22:05
Comment on lines +537 to +539
if std::fs::exists(EPHEMERAL_DATA_LINK).is_ok_and(|x| x) {
std::fs::remove_file(EPHEMERAL_DATA_LINK).context(error::DiskUnlinkFailureSnafu {})?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q - for the idempotency problem, I wonder if we can introduce any testing to avoid future regression. Unit test might be hard as we might need to introduce a series of mocking to be able to test functions like encrypt_ephemeral_device.

Comment on lines +261 to +273
let is_source_masked = is_masked(&source_dir)?;
let is_target_mounted = is_mounted(target_dir)?;

match (is_target_mounted, is_source_masked) {
(true, true) => continue,
(true, false) => {
dirs_to_mask.insert(source_dir.into());
}
(false, false) => {
dirs_to_bind.insert((source_dir.clone(), target_dir.clone()));
dirs_to_mask.insert(source_dir.into_os_string());
}
(false, true) => error::DirectoryAlreadyMaskedSnafu { dir: source_dir }.fail()?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap this logic in a modular function and unit test them? Right now they are not quite testable if included as part of bind().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap this logic in a modular function and unit test them? Right now they are not quite testable if included as part of bind().

Making this unit testable is a good idea, but it's a larger refactor than I want to tackle here. Are you OK with a backlog issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, this is not a blocking comment!

Comment on lines 246 to 247
let mut dirs_to_bind = HashSet::new();
let mut dirs_to_mask = HashSet::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q - does the order matter here for the dirs to be binded/masked? If yes, we should probably use Vec<> instead.

Copy link
Contributor

@KCSesh KCSesh Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem I can see with this is if mount points are children of another.

Which makes me wonder about adding additional input sanity on overlapping paths and if we want to protect against that.

/mnt/test and /mnt/test/sub - technically with a hashset mount order is non-deterministic, which with overlapping paths could make data inaccessible.

We could catch that as invalid 🤔

Details
for (i, dir_a) in dirs.iter().enumerate() {
    let path_a = Path::new(dir_a);
    for dir_b in dirs.iter().skip(i + 1) {
        let path_b = Path::new(dir_b);
        if path_a.starts_with(path_b) || path_b.starts_with(path_a) {
            return error::InvalidParameterSnafu {
                parameter: format!("{}, {}", dir_a, dir_b),
                reason: "overlapping bind directories not allowed",
            }
            .fail();
        }
    }
}

This problem exists today though - but across multiple calls:

apiclient ephemeral-storage bind --targets /mnt/a/sub
# later...
apiclient ephemeral-storage bind --targets /mnt/a

I could see this happening across multiple bootstrap containers that run and set different binds (where we don't guarantee launch order)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering matters for bind mounts, but not for the masking part since masking happens on ._path_to_dir.

Should be easy to fix the mount ordering, e.g. by sorting, I can add that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this happening across multiple bootstrap containers that run and set different binds (where we don't guarantee launch order)

Fixing this - or at least detecting it - across multiple calls is a reasonable idea, but will have to wait for a future change 😀

I can fix it for the single call case though.

/// binds the specified directories to the pre-configured array, creating those directories if
/// Binds the specified directories to the pre-configured array, creating those directories if
/// they do not exist.
pub fn bind(variant: &str, dirs: Vec<String>) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: With more logic adding to bind() it has become a rather large function. To maintain readability, should we consider modularize some of the new code added here?

Ensure that `/mnt/parent` is mounted before `/mnt/parent/child`, as
otherwise the parent's storage may be mounted over the child's, and
files could be stored in the wrong location.

Signed-off-by: Ben Cressey <bcressey@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants