Skip to content

support recursive read only (rro) option for mounts #25680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danishprakash
Copy link
Contributor

ro on linux kernel < 5.12 doesn't apply MT_RECURSIVE leading to submounts not respecting the readonly option.

crun/runc achieve this by calling mount_setattr(2) when passed the rro mount option. This change check whether the runtime in use supports rro and the kernel supports mount_setattr(2) and passes rro to the runtime. It defaults to rro when supported and when ro is specified. If rro is specified and unsupported, it returns an error.

Closes #24229

Signed-off-by: Danish Prakash contact@danishpraka.sh

`ro` on linux kernel < 5.12 doesn't apply MT_RECURSIVE
leading to submounts not respecting the readonly option.

crun/runc achieve this by calling mount_setattr(2) when passed
the `rro` mount option. This change check whether the runtime in use
supports `rro` and the kernel supports mount_setattr(2) and passes
`rro` to the runtime. It defaults to `rro` when supported and when
`ro` is specified. If `rro` is specified and unsupported, it returns an
error.

Signed-off-by: Danish Prakash <contact@danishpraka.sh>
Copy link
Contributor

openshift-ci bot commented Mar 25, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None label Mar 25, 2025
Copy link
Contributor

openshift-ci bot commented Mar 25, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danishprakash
Once this PR has been reviewed and has the lgtm label, please assign mtrmac for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Cockpit tests failed for commit 87c34b1. @martinpitt, @jelly, @mvollmer please check.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I like to see some more discussion around the features check. I at least have performance concerns with such implementation.

cc @giuseppe @mheon @baude

Comment on lines +136 to +140
features, err := runtime.getOCIRuntimeFeatures()
if err != nil {
return nil, fmt.Errorf("getting %s features: %w", runtime.name, err)
}
runtime.features = features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is not acceptable for performance reasons. Every podman command would then have to run this even when the vast majority of them will never need this.
I know the features command seems very fast but it is still a total overhead of a few ms. And we have users on embedded systems where they are much more concerned about every small thing.

Second how many runtimes actually support the features command? Right not this hard error which means it can cause regressions.

I see three options:

  • don't for any feature checks and assume rro is always supported (that means we would need to bump the kernel baseline and document what runtimes it support)
  • wrap this into a sync.OnceValues() function so it is only called onca and only when needed.
  • add a compile time option for podman to specify if rro should be used. Distros that ship podman with a new kernel and runtime then don't have to pay the feature check overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have also, in the past, added fields to containers.conf to indicate which runtimes supported certain features - see supportsNoCgroups for example. Adding to containers.conf is not perfect (it shifts the burden to the packager to make sure their combination of kernel + runtime version supports RRO) but it is very performant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I wonder if we could lazy-initialize this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ro is a common option, so we will always end up with the price of running $OCI_RUNTIME features on each invocation. Also there is the risk with podman system service that we will not detect when the OCI runtime was updated.

IMO we could cache the result under the tmpdir using the runtime path as the key, and then use the runtime binary stat (st_dev, st_ino and mtime) information to detect when the cache is out of date.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I wonder if we could lazy-initialize this

That is my suggest 2, using sync.OnceValues()

ro is a common option, so we will always end up with the price of running $OCI_RUNTIME features on each invocation

How so? Any command that does not start a container will certainly never need it. podman stop, podman rm, podman ps, podman logs, etc... How I look at this only the minority will actually need this.

IMO we could cache the result under the tmpdir using the runtime path as the key, and then use the runtime binary stat (st_dev, st_ino and mtime) information to detect when the cache is out of date.

Yes that sounds good to me, there is an overhead for stat and reading the file but compared to the rest we do that is likely not a problem. Should be much better than spawning a new process each time.


Even worse reading the code again we actually call newConmonOCIRuntime() on all runtimes from containers.conf so we pay the overhead multiple times for all runtimes which really doesn't feel right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ro is a common option, so we will always end up with the price of running $OCI_RUNTIME features on each invocation

How so? Any command that does not start a container will certainly never need it. podman stop, podman rm, podman ps, podman logs, etc... How I look at this only the minority will actually need this.

I was referring only to the commands that run a container, and that is where we care more about performance. sync.OnceValues() won't work with system service as we need to check that the runtime didn't change.

On my system I've:

➜ hyperfine 'crun features'
Benchmark 1: crun features
  Time (mean ± σ):       2.3 ms ±   0.6 ms    [User: 0.8 ms, System: 1.6 ms]
  Range (min … max):     1.6 ms …   6.2 ms    553 runs

a stat() is much faster than that. If we really care about performance we could even consider statx() and query only the information we need. Most of the price will come from dealing with json anyway, so maybe we want to store the information in a different format and store only the bits we care about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the price will come from dealing with json anyway, so maybe we want to store the information in a different format and store only the bits we care about.

Sounds like a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry for the multi-week delay in response)


wrap this into a sync.OnceValues() function so it is only called onca and only when needed.

I prefer lazy-initializing the runtime feature over the other two options suggested by @Luap99 as well. As far as caching the output under tmpdir, I'm under the assumption--as @giuseppe pointed out--that it's to avoid a situation where the runtime changes. It shouldn't make too much of a dent in terms of performance.

errKernelDoesNotSupportRRO = errors.New("kernel does not support recursive readonly mount option `rro`")
errRuntimeDoesNotSupportRRO = errors.New("runtime does not support recursive readonly mount option `rro`")

kernelSupportsRROOnce sync.Once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new code should use sync.OnceValue(s) instead

@@ -194,6 +195,11 @@ func (r *MissingRuntime) SupportsKVM() bool {
return false
}

// Features returns nil since this is a missing runtime
func (r *MissingRuntime) Features() *features.Features {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still return an error, so I recommend (*features.Features, error) - the missing runtime can have methods called on it and we want to avoid null pointer panics

@@ -374,6 +374,27 @@ func (c *Container) generateSpec(ctx context.Context) (s *spec.Spec, cleanupFunc
// Podman decided for --no-dereference as many
// bin-utils tools (e..g, touch, chown, cp) do.
options = append(options, "copy-symlink")
// TODO: this also ends up checking non-user mounts
case "ro", "rro":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check rro here? It seems like we should always pass it along unmodified, even if unsupported - if the user explicitly requested rro and we silently downgrade, that seems like a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, rro must be passed as it is.

Also we need to have a way to restore the previous behavior. If a user really wants a ro, there is no way to achieve it when we always force ro -> rro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check rro here? It seems like we should always pass it along unmodified, even if unsupported - if the user explicitly requested rro and we silently downgrade, that seems like a problem.

Downgrade shouldn't happen. The intent for an rro check here is to upgrade ro to rro when supported and to error out when a user explicitly requests rro and it isn't supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need to have a way to restore the previous behavior. If a user really wants a ro, there is no way to achieve it when we always force ro -> rro

But why? I can understand for backwards compat but that is wrong from a security point of view. I don't know how bad the impact would be when we roll out this change, but we can introduce a non-recursive (norro) as part of the transition till the buggy ro behaviour fades out. Wdyt?

@martinpitt
Copy link
Contributor

The cockpit-podman failure is interesting and relevant. /tmp/ro is mounted read-only, and yet the API has volume.RW == true. (c-podman doesn't know about/test rro yet, of course)

@Luap99
Copy link
Member

Luap99 commented Mar 26, 2025

Another thing I just thought off, this does a kernel feature check but is that actually correct? We might have VM based runtimes such as krun, with the VM runtimes the mount options would be applied in the VM kernel not on the host right? Not sure if this is a real concern.

Comment on lines +273 to +297
tmpMnt, err := os.MkdirTemp("", "podman-detect-rro")
if err != nil {
return fmt.Errorf("failed to create a temp directory: %w", err)
}
for {
err = unix.Mount("", tmpMnt, "tmpfs", 0, "")
if !errors.Is(err, unix.EINTR) {
break
}
}
if err != nil {
return fmt.Errorf("failed to mount tmpfs on %q: %w", tmpMnt, err)
}
defer func() {
var umErr error
for {
umErr = unix.Unmount(tmpMnt, 0)
if !errors.Is(umErr, unix.EINTR) {
break
}
}
if umErr != nil {
logrus.Errorf("Failed to unmount %q: %v", tmpMnt, umErr)
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use open_tree or fsopen so we don't have to create a mount that could be leaked to the host.

Comment on lines +314 to +317
kernelSupportsRROOnce.Do(func() {
errKernelDoesNotSupportRRO = fn()
})
return errKernelDoesNotSupportRRO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this by using sync.OnceFunc.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2025
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support recursive read-only bind mounts (kernel >= 5.12)
7 participants