-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1710: Use mount to apply SELinux label to volumes #3172
Conversation
b4239ca
to
2aae0a6
Compare
Propose usage of "mount -o context=XYZ" to speed up (remove) recursive re-labeling of volumes during pod startup.
2aae0a6
to
edff64b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP states runtime changes as a non-goal and calls out fall-back to default behavior (container-runtime driven re-labelling) when selinux labels are not specified as part of pod security context. Can any references/notes be added to any existing infra/system (say that uses a mutating webhook) for managing/applying selinux labels for pods in a safe cluster-wide fashion so that all stateful pods in a cluster can take advantage of this? A couple of other comments inline but overall, this looks great!
e71b9d8
to
437a681
Compare
437a681
to
77562fd
Compare
* kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`). | ||
Files on the volume already have the right SELinux context, no recursive relabeling happens there. | ||
* When a `PodSecurityContext` specifies incomplete SELinux label (i.e. omits `SELinuxOptions.User`, `.Role` or `.Type`), kubelet fills the blanks from the system defaults provided by [ContainerLabels() from go-selinux bindings](https://github.com/opencontainers/selinux/blob/621ca21a5218df44259bf5d7a6ee70d720a661d5/go-selinux/selinux_linux.go#L770). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker - but for a casual reader of this KEP, can we add some examples of how this API will be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to user story 2, which has an example.
* Second mount of the same volume fails with `mount point not mounted or bad option` when the `-o context=` does not match the first mount. | ||
* A bind mount inherits `-o context=XYZ` option from the original mount (if it was set there). | ||
* `/bin/mount` process must detect that the kernel supports SELinux. [It does so](https://github.com/util-linux/util-linux/blob/441f9b9303d015f1777aec7168807d58feacca31/libmount/src/context_mount.c#L291) by checking that both `/etc/selinux/config` file exists and `/sys/fs/selinux` is a mount point with "selinuxfs" type. | ||
* `/bind/mount` throws away the SELinux mount options otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some error detection or something else when mount
is silently throwing away selinux mount options? I think that can happen if driver pod is misconfigured or something else iirc? (such as /sys
is not exposed inside driver pod).
Also - what happens when mount is silently throwing the mount options? Do we fallback to old behaviour or try to run pod without any selinux labels (via mount or otherwise?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no indication that /bin/mount
drops SELinux mount options. See util-linux code. It's quite unfortunate.
Do we fallback to old behaviour or try to run pod without any selinux labels (via mount or otherwise?)
That's a good question. For CSI, the driver did advertise that it supports -o context
. so it's driver's fault if it did not apply the context. We can't check mount options in kubelet, because not all CSI drivers mount or they may mount in a weird way. We could check context of the root of the volume and fail, if it has unexpected context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could check context of the root of the volume and fail, if it has unexpected context.
I like this option if we can manage it. Can we add this detail to the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to detailed design.
* Kubelet's VolumeManager needs to track which SELinux label should get a volume in global mount (to call `MountDevice()` with the right mount options). | ||
* It must call `UnmountDevice()` even when another pod wants to re-use a mounted volume, but it has a different SELinux context. | ||
* While tracking SELinux labels of volumes, it can emit metrics suggested below. | ||
* Volume plugins will get SELinux context as a new parameter of `MountDevice` and `SetUp`/`SetupAt` calls (resp. as a new field in `DeviceMounterArgs` / `MounterArgs`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to track selinux context for each setup call also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setup()
is called when (bind-)mounting a volume to pod directory. This mount is not shared among pods, I think it's safe not to track it.
@@ -403,7 +389,7 @@ _This section must be completed when targeting alpha to a release._ | |||
|
|||
* **How can this feature be enabled / disabled in a live cluster?** | |||
- [X] Feature gate (also fill in values in `kep.yaml`) | |||
- Feature gate name: SELinuxRelabelPolicy | |||
- Feature gate name: `SELinuxMountReadWriteOncePod` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the final feature gate or while the feature works only for ReadWriteOncePod
pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the final feature gate, we need another one for all volumes. It's called SELinuxMount
below.
// This is typical for volumes that represent subdirectories of a bigger shared filesystem. | ||
// | ||
// Default is "false". | ||
SELinuxMountSupported *bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a setting at CSI driver level. Do we need something that has a finer granularity? For example, what if a CSI driver supports both block and file share volumes and it is true for block but false for file share volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need CSI change for this - the driver would need to advertise which volume supports SELinux mount options and which does not. Kubelet cannot not know that. I tried to avoid CSI changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single CSI driver supporting multiple volume types is not well supported today. I think a CSI driver will have to filter out the SELinux mount option in the NodeStage/Publish call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect CSI drivers to filter SELinux options, then I'd prefer to have an explicit field in CSI NodeStage/NodePublish for that, so they know what to filter.
Do we really expect that? I am afraid that the most CSI drivers won't care about SELinux at all, as it's used in minority of Kubernetes distros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the onprem world there are more users that care about running on RHEL with SELinux. Maybe leave this as a open item to resolve before beta?
|
||
* Kubernetes checks that `SELinuxRelabelPolicy` field can be used in a pod only when at least `SELinuxOptions.Level` is set. | ||
* *Before this KEP*: Pod A suddenly starts getting "permission denied" errors when accessing files on the volume, because the container runtime re-labeled all files on it with label Y when starting pod B. Pod B will start just fine and can access the volume. | ||
* *As proposed in this KEP*: Pod B won't even start, because the volume is already mounted with `-o context=X`. When kubelet tries to mount the same volume with `-o context=Y`, this mount fails. The Pod B with be `ContainerCreating` until Pod A is deleted and its volumes unmounted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we have events/messages indicating why Pod B stays in ContainerCreating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is a generic mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error
. /bin/mount
does not tell which mount option is the bad one.
Added a note about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is a generic mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. /bin/mount does not tell which mount option is the bad one.
Can external logic on a node look at the mount that is failing and determine if it is being mounted twice with different -o context=foo
options? That message is really hard for laymen to interpret and doesn't give much hint to the pod author.
This will come up in the beta PRR review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some wording around - kubelet will track its mounts and with what context they were set up, so it may throw a nice error like volume X is already used by pod Y with another SELinux context
.
When a Pod specifies incomplete SELinux label (i.e. omits `SELinuxOptions.User`, `.Role` or `.Type`), kubelet fills the blanks from the system defaults provided by [ContainerLabels() from go-selinux bindings](https://github.com/opencontainers/selinux/blob/621ca21a5218df44259bf5d7a6ee70d720a661d5/go-selinux/selinux_linux.go#L770). | ||
A special case of the previous example is when two pods with different SELinux contexts use the same volume, but different subpaths of it. | ||
The container runtime then re-labels only these subpaths and as long as the subpaths are different, both pods can run today. | ||
**This will not be possible with this KEP** - while the container runtime operates on subpaths, kubelet always mounts the whole volume. The first mount with `-o context=X` will succeed, the second with `-o context=Y` will fail. The second Pod will be `ContainerCreating` as described above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we have events/messages indicating why Pod B stays in ContainerCreating in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, mount will fail, most probably with a very generic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a better message above when kubelet knows that the SELinux contexts are wrong.
* kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`). | ||
Files on the volume already have the right SELinux context, no recursive relabeling happens there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this split responsibility about who should relabel what can be a concern. but from a container runtime perspective this is fine.
68637d6
to
ba0ed80
Compare
I squashed old commits, added a new commit with volume reconstruction / upgrade. Marked as implementable. |
90ec079
to
ec30aca
Compare
ec30aca
to
14c8b48
Compare
/hold cancel |
The issues related to kubelet restart are solvable (although not very straightforward) and hence the kep looks good to me. /lgtm |
* **Is the rollout accompanied by any deprecations and/or removals of features, | ||
APIs, fields of API types, flags, etc.?** | ||
Even if applying deprecation policies, they may still surprise some users. | ||
|
||
**YES!** See [Behavioral changes](#behavioral-changes) and [Implementation phases](implementation-phases) above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new failure mode looks slightly better to my eyes, but it still looks difficult to debug as a pod author.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the error messages where kubelet can know what's wrong.
|
||
* **What specific metrics should inform a rollback?** | ||
|
||
TBD: We propose a metric above, file its name here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh. forgot to fill this in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be required for Alpha. We'll figure it out later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor nits/clarifications
|
||
### Examples | ||
1) Kubelet knows that the in-tree AWS CSI plugin supports mounting with `-o context`. The mount option is then used (if pod context is known) and the container runtime does not relabel the volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in-tree AWS CSI plugin => in-tree EBS plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
* Each volume plugin can choose to use the mount option `-o context=` (e.g. when `CSIDriver.SELinuxRelabelPolicy` is `true`) or ignore it (e.g. in-tree volume plugins for shared filesystems or when `CSIDriver.SELinuxRelabelPolicy` is `false` or `nil`). | ||
* Each volume plugin then returns `SupportsSELinux` from `GetAttributes()` call, depending on if it wants the container runtime to relabel the volume (`true`) or not (`false`; the volume was already mounted with the right label or it does not support SELinux at all). | ||
* When a CSI driver announces `SELinuxMountSupported: true`, kubelet will check that `-o context=X` was correctly applied after `NodePublish()`. | ||
It will report error when the context in `/proc/mounts` does not match the expected value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will an error here (due to mismatch in /proc/mount wrt context) lead to the overall NodePublish being considered failed leading to retries with backoffs until the /proc/mount entry matches expectations? Or will this be a warning event that does not block pod sandbox creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial idea that it will be a SetUp()
error, i.e. kubelet will retry SetUp()
again, until the driver fixes it. The pod won't start.
Question is, if the driver can fix it, if it mounted the volume wrong at the first attempt - any retry probably won't help much here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewording the bullet to mention the consequence - the pod won't ever start, the CSI driver is faulty.
@@ -403,7 +403,7 @@ _This section must be completed when targeting alpha to a release._ | |||
|
|||
* **How can this feature be enabled / disabled in a live cluster?** | |||
- [X] Feature gate (also fill in values in `kep.yaml`) | |||
- Feature gate name: SELinuxRelabelPolicy | |||
- Feature gate name: `SELinuxMountReadWriteOncePod` | |||
- Components depending on the feature gate: apiserver (API validation only), kubelet | |||
- [ ] Other | |||
- Describe the mechanism: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice this last time. You're missing an answer for this section. i believe you need need a drain and restart of the kubelet to change this setting. Please describe that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice this last time. You're missing an answer for this section. i believe you need need a drain and restart of the kubelet to change this setting. Please describe that.
I see the answer a few questions down.
the monitoring and troubleshooting sections are often the hardest when moving from alpha to beta. I encourage you to consider those flows while developing the alpha. The PRR is complete for alpha. /approve |
…text And other minor edits
Added a commit with last minute edits. |
Can you squash the commits? /lgtm /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jsafrane, xing-yang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
mount -o context=XYZ
to apply SELinux label to volumesOther comments:
cc @gnufied @xing-yang @jingxu97 @msau42