-
Notifications
You must be signed in to change notification settings - Fork 2.6k
support recursive read only (rro
) option for mounts
#25680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
`ro` on linux kernel < 5.12 doesn't apply MT_RECURSIVE leading to submounts not respecting the readonly option. crun/runc achieve this by calling mount_setattr(2) when passed the `rro` mount option. This change check whether the runtime in use supports `rro` and the kernel supports mount_setattr(2) and passes `rro` to the runtime. It defaults to `rro` when supported and when `ro` is specified. If `rro` is specified and unsupported, it returns an error. Signed-off-by: Danish Prakash <contact@danishpraka.sh>
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: danishprakash The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Cockpit tests failed for commit 87c34b1. @martinpitt, @jelly, @mvollmer please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features, err := runtime.getOCIRuntimeFeatures() | ||
if err != nil { | ||
return nil, fmt.Errorf("getting %s features: %w", runtime.name, err) | ||
} | ||
runtime.features = features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is not acceptable for performance reasons. Every podman command would then have to run this even when the vast majority of them will never need this.
I know the features command seems very fast but it is still a total overhead of a few ms. And we have users on embedded systems where they are much more concerned about every small thing.
Second how many runtimes actually support the features command? Right not this hard error which means it can cause regressions.
I see three options:
- don't for any feature checks and assume rro is always supported (that means we would need to bump the kernel baseline and document what runtimes it support)
- wrap this into a sync.OnceValues() function so it is only called onca and only when needed.
- add a compile time option for podman to specify if rro should be used. Distros that ship podman with a new kernel and runtime then don't have to pay the feature check overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have also, in the past, added fields to containers.conf to indicate which runtimes supported certain features - see supportsNoCgroups
for example. Adding to containers.conf is not perfect (it shifts the burden to the packager to make sure their combination of kernel + runtime version supports RRO) but it is very performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I wonder if we could lazy-initialize this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ro
is a common option, so we will always end up with the price of running $OCI_RUNTIME features
on each invocation. Also there is the risk with podman system service
that we will not detect when the OCI runtime was updated.
IMO we could cache the result under the tmpdir using the runtime path as the key, and then use the runtime binary stat (st_dev
, st_ino
and mtime
) information to detect when the cache is out of date.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I wonder if we could lazy-initialize this
That is my suggest 2, using sync.OnceValues()
ro is a common option, so we will always end up with the price of running $OCI_RUNTIME features on each invocation
How so? Any command that does not start a container will certainly never need it. podman stop, podman rm, podman ps, podman logs, etc... How I look at this only the minority will actually need this.
IMO we could cache the result under the tmpdir using the runtime path as the key, and then use the runtime binary stat (st_dev, st_ino and mtime) information to detect when the cache is out of date.
Yes that sounds good to me, there is an overhead for stat and reading the file but compared to the rest we do that is likely not a problem. Should be much better than spawning a new process each time.
Even worse reading the code again we actually call newConmonOCIRuntime() on all runtimes from containers.conf so we pay the overhead multiple times for all runtimes which really doesn't feel right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ro is a common option, so we will always end up with the price of running $OCI_RUNTIME features on each invocation
How so? Any command that does not start a container will certainly never need it. podman stop, podman rm, podman ps, podman logs, etc... How I look at this only the minority will actually need this.
I was referring only to the commands that run a container, and that is where we care more about performance. sync.OnceValues()
won't work with system service
as we need to check that the runtime didn't change.
On my system I've:
➜ hyperfine 'crun features'
Benchmark 1: crun features
Time (mean ± σ): 2.3 ms ± 0.6 ms [User: 0.8 ms, System: 1.6 ms]
Range (min … max): 1.6 ms … 6.2 ms 553 runs
a stat()
is much faster than that. If we really care about performance we could even consider statx()
and query only the information we need. Most of the price will come from dealing with json anyway, so maybe we want to store the information in a different format and store only the bits we care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the price will come from dealing with json anyway, so maybe we want to store the information in a different format and store only the bits we care about.
Sounds like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry for the multi-week delay in response)
wrap this into a sync.OnceValues() function so it is only called onca and only when needed.
I prefer lazy-initializing the runtime feature over the other two options suggested by @Luap99 as well. As far as caching the output under tmpdir, I'm under the assumption--as @giuseppe pointed out--that it's to avoid a situation where the runtime changes. It shouldn't make too much of a dent in terms of performance.
errKernelDoesNotSupportRRO = errors.New("kernel does not support recursive readonly mount option `rro`") | ||
errRuntimeDoesNotSupportRRO = errors.New("runtime does not support recursive readonly mount option `rro`") | ||
|
||
kernelSupportsRROOnce sync.Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new code should use sync.OnceValue(s)
instead
@@ -194,6 +195,11 @@ func (r *MissingRuntime) SupportsKVM() bool { | |||
return false | |||
} | |||
|
|||
// Features returns nil since this is a missing runtime | |||
func (r *MissingRuntime) Features() *features.Features { | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still return an error, so I recommend (*features.Features, error) - the missing runtime can have methods called on it and we want to avoid null pointer panics
@@ -374,6 +374,27 @@ func (c *Container) generateSpec(ctx context.Context) (s *spec.Spec, cleanupFunc | |||
// Podman decided for --no-dereference as many | |||
// bin-utils tools (e..g, touch, chown, cp) do. | |||
options = append(options, "copy-symlink") | |||
// TODO: this also ends up checking non-user mounts | |||
case "ro", "rro": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check rro
here? It seems like we should always pass it along unmodified, even if unsupported - if the user explicitly requested rro
and we silently downgrade, that seems like a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, rro must be passed as it is.
Also we need to have a way to restore the previous behavior. If a user really wants a ro
, there is no way to achieve it when we always force ro
-> rro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check
rro
here? It seems like we should always pass it along unmodified, even if unsupported - if the user explicitly requestedrro
and we silently downgrade, that seems like a problem.
Downgrade shouldn't happen. The intent for an rro
check here is to upgrade ro
to rro
when supported and to error out when a user explicitly requests rro
and it isn't supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we need to have a way to restore the previous behavior. If a user really wants a
ro
, there is no way to achieve it when we always forcero
->rro
But why? I can understand for backwards compat but that is wrong from a security point of view. I don't know how bad the impact would be when we roll out this change, but we can introduce a non-recursive (norro
) as part of the transition till the buggy ro
behaviour fades out. Wdyt?
The cockpit-podman failure is interesting and relevant. |
Another thing I just thought off, this does a kernel feature check but is that actually correct? We might have VM based runtimes such as krun, with the VM runtimes the mount options would be applied in the VM kernel not on the host right? Not sure if this is a real concern. |
tmpMnt, err := os.MkdirTemp("", "podman-detect-rro") | ||
if err != nil { | ||
return fmt.Errorf("failed to create a temp directory: %w", err) | ||
} | ||
for { | ||
err = unix.Mount("", tmpMnt, "tmpfs", 0, "") | ||
if !errors.Is(err, unix.EINTR) { | ||
break | ||
} | ||
} | ||
if err != nil { | ||
return fmt.Errorf("failed to mount tmpfs on %q: %w", tmpMnt, err) | ||
} | ||
defer func() { | ||
var umErr error | ||
for { | ||
umErr = unix.Unmount(tmpMnt, 0) | ||
if !errors.Is(umErr, unix.EINTR) { | ||
break | ||
} | ||
} | ||
if umErr != nil { | ||
logrus.Errorf("Failed to unmount %q: %v", tmpMnt, umErr) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use open_tree
or fsopen
so we don't have to create a mount that could be leaked to the host.
kernelSupportsRROOnce.Do(func() { | ||
errKernelDoesNotSupportRRO = fn() | ||
}) | ||
return errKernelDoesNotSupportRRO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify this by using sync.OnceFunc
.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ro
on linux kernel < 5.12 doesn't apply MT_RECURSIVE leading to submounts not respecting the readonly option.crun/runc achieve this by calling mount_setattr(2) when passed the
rro
mount option. This change check whether the runtime in use supportsrro
and the kernel supports mount_setattr(2) and passesrro
to the runtime. It defaults torro
when supported and whenro
is specified. Ifrro
is specified and unsupported, it returns an error.Closes #24229
Signed-off-by: Danish Prakash contact@danishpraka.sh