KEP-1710: Use mount to apply SELinux label to volumes #3172

jsafrane · 2022-01-21T14:31:38Z

One-line PR description: Update existing KEP to use mount -o context=XYZ to apply SELinux label to volumes

Issue link: Speed up recursive SELinux label change #1710

Other comments:

The existing KEP was turned down during API review. Reworking the KEP to expose less knobs to users and do the right thing out of the box. As a tradeoff, the KEP changes existing behavior of Kubernetes in a corner case.
Targeting alpha in 1.24.

cc @gnufied @xing-yang @jingxu97 @msau42

Propose usage of "mount -o context=XYZ" to speed up (remove) recursive re-labeling of volumes during pod startup.

keps/sig-storage/1710-selinux-relabeling/README.md

ddebroy

The KEP states runtime changes as a non-goal and calls out fall-back to default behavior (container-runtime driven re-labelling) when selinux labels are not specified as part of pod security context. Can any references/notes be added to any existing infra/system (say that uses a mutating webhook) for managing/applying selinux labels for pods in a safe cluster-wide fashion so that all stateful pods in a cluster can take advantage of this? A couple of other comments inline but overall, this looks great!

keps/sig-storage/1710-selinux-relabeling/README.md

gnufied · 2022-01-24T20:48:33Z

keps/sig-storage/1710-selinux-relabeling/README.md

+  * kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`).
+    Files on the volume already have the right SELinux context, no recursive relabeling happens there.
+  * When a `PodSecurityContext` specifies incomplete SELinux label (i.e. omits `SELinuxOptions.User`, `.Role` or `.Type`), kubelet fills the blanks from the system defaults provided by [ContainerLabels() from go-selinux bindings](https://github.com/opencontainers/selinux/blob/621ca21a5218df44259bf5d7a6ee70d720a661d5/go-selinux/selinux_linux.go#L770).
+


Not a blocker - but for a casual reader of this KEP, can we add some examples of how this API will be used?

Added a link to user story 2, which has an example.

gnufied · 2022-01-24T20:51:16Z

keps/sig-storage/1710-selinux-relabeling/README.md

+    * Second mount of the same volume fails with `mount point not mounted or bad option` when the `-o context=` does not match the first mount.
+    * A bind mount inherits `-o context=XYZ` option from the original mount (if it was set there). 
+* `/bin/mount` process must detect that the kernel supports SELinux. [It does so](https://github.com/util-linux/util-linux/blob/441f9b9303d015f1777aec7168807d58feacca31/libmount/src/context_mount.c#L291) by checking that both `/etc/selinux/config` file exists and `/sys/fs/selinux` is a mount point with "selinuxfs" type.
+    * `/bind/mount` throws away the SELinux mount options otherwise.


Is there some error detection or something else when mount is silently throwing away selinux mount options? I think that can happen if driver pod is misconfigured or something else iirc? (such as /sys is not exposed inside driver pod).

Also - what happens when mount is silently throwing the mount options? Do we fallback to old behaviour or try to run pod without any selinux labels (via mount or otherwise?).

There is no indication that /bin/mount drops SELinux mount options. See util-linux code. It's quite unfortunate.

Do we fallback to old behaviour or try to run pod without any selinux labels (via mount or otherwise?)

That's a good question. For CSI, the driver did advertise that it supports -o context. so it's driver's fault if it did not apply the context. We can't check mount options in kubelet, because not all CSI drivers mount or they may mount in a weird way. We could check context of the root of the volume and fail, if it has unexpected context.

We could check context of the root of the volume and fail, if it has unexpected context.

I like this option if we can manage it. Can we add this detail to the KEP?

Added a note to detailed design.

gnufied · 2022-01-24T20:59:37Z

keps/sig-storage/1710-selinux-relabeling/README.md

+* Kubelet's VolumeManager needs to track which SELinux label should get a volume in global mount (to call `MountDevice()` with the right mount options).
+  * It must call `UnmountDevice()` even when another pod wants to re-use a mounted volume, but it has a different SELinux context.
+  * While tracking SELinux labels of volumes, it can emit metrics suggested below.
+* Volume plugins will get SELinux context as a new parameter of `MountDevice` and `SetUp`/`SetupAt` calls (resp. as a new field in `DeviceMounterArgs` / `MounterArgs`).


Are we planning to track selinux context for each setup call also?

Setup() is called when (bind-)mounting a volume to pod directory. This mount is not shared among pods, I think it's safe not to track it.

gnufied · 2022-01-24T21:00:52Z

keps/sig-storage/1710-selinux-relabeling/README.md

@@ -403,7 +389,7 @@ _This section must be completed when targeting alpha to a release._

 * **How can this feature be enabled / disabled in a live cluster?**
  - [X] Feature gate (also fill in values in `kep.yaml`)
-    - Feature gate name: SELinuxRelabelPolicy
+    - Feature gate name: `SELinuxMountReadWriteOncePod`


Is this the final feature gate or while the feature works only for ReadWriteOncePod pods?

Yes, it's the final feature gate, we need another one for all volumes. It's called SELinuxMount below.

keps/sig-storage/1710-selinux-relabeling/README.md

xing-yang · 2022-01-25T03:26:33Z

keps/sig-storage/1710-selinux-relabeling/README.md

+        // This is typical for volumes that represent subdirectories of a bigger shared filesystem.
+        //
+        // Default is "false".
+        SELinuxMountSupported *bool;


This is a setting at CSI driver level. Do we need something that has a finer granularity? For example, what if a CSI driver supports both block and file share volumes and it is true for block but false for file share volumes?

We would need CSI change for this - the driver would need to advertise which volume supports SELinux mount options and which does not. Kubelet cannot not know that. I tried to avoid CSI changes here.

A single CSI driver supporting multiple volume types is not well supported today. I think a CSI driver will have to filter out the SELinux mount option in the NodeStage/Publish call.

If we expect CSI drivers to filter SELinux options, then I'd prefer to have an explicit field in CSI NodeStage/NodePublish for that, so they know what to filter.

Do we really expect that? I am afraid that the most CSI drivers won't care about SELinux at all, as it's used in minority of Kubernetes distros.

I think in the onprem world there are more users that care about running on RHEL with SELinux. Maybe leave this as a open item to resolve before beta?

xing-yang · 2022-01-25T03:33:00Z

keps/sig-storage/1710-selinux-relabeling/README.md


-* Kubernetes checks that `SELinuxRelabelPolicy` field can be used in a pod only when at least `SELinuxOptions.Level` is set.
+* *Before this KEP*: Pod A suddenly starts getting "permission denied" errors when accessing files on the volume, because the container runtime re-labeled all files on it with label Y when starting pod B. Pod B will start just fine and can access the volume.
+* *As proposed in this KEP*: Pod B won't even start, because the volume is already mounted with `-o context=X`. When kubelet tries to mount the same volume with `-o context=Y`, this mount fails. The Pod B with be `ContainerCreating` until Pod A is deleted and its volumes unmounted.


Will we have events/messages indicating why Pod B stays in ContainerCreating?

The message is a generic mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. /bin/mount does not tell which mount option is the bad one.

Added a note about it.

The message is a generic mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. /bin/mount does not tell which mount option is the bad one.

Can external logic on a node look at the mount that is failing and determine if it is being mounted twice with different -o context=foo options? That message is really hard for laymen to interpret and doesn't give much hint to the pod author.

This will come up in the beta PRR review.

I added some wording around - kubelet will track its mounts and with what context they were set up, so it may throw a nice error like volume X is already used by pod Y with another SELinux context.

xing-yang · 2022-01-25T03:34:11Z

keps/sig-storage/1710-selinux-relabeling/README.md

-When a Pod specifies incomplete SELinux label (i.e. omits `SELinuxOptions.User`, `.Role` or `.Type`), kubelet fills the blanks from the system defaults provided by [ContainerLabels() from go-selinux bindings](https://github.com/opencontainers/selinux/blob/621ca21a5218df44259bf5d7a6ee70d720a661d5/go-selinux/selinux_linux.go#L770).
+A special case of the previous example is when two pods with different SELinux contexts use the same volume, but different subpaths of it.
+The container runtime then re-labels only these subpaths and as long as the subpaths are different, both pods can run today.
+**This will not be possible with this KEP** - while the container runtime operates on subpaths, kubelet always mounts the whole volume. The first mount with `-o context=X` will succeed, the second with `-o context=Y` will fail. The second Pod will be `ContainerCreating` as described above.


Will we have events/messages indicating why Pod B stays in ContainerCreating in this case?

Same as above, mount will fail, most probably with a very generic message.

Added a better message above when kubelet knows that the SELinux contexts are wrong.

saschagrunert · 2022-01-26T09:35:27Z

keps/sig-storage/1710-selinux-relabeling/README.md

+  * kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`).
+    Files on the volume already have the right SELinux context, no recursive relabeling happens there.


I think this split responsibility about who should relabel what can be a concern. but from a container runtime perspective this is fine.

jsafrane · 2022-01-28T17:22:05Z

I squashed old commits, added a new commit with volume reconstruction / upgrade. Marked as implementable.

gnufied · 2022-01-31T15:00:15Z

/hold cancel

gnufied · 2022-01-31T16:20:32Z

The issues related to kubelet restart are solvable (although not very straightforward) and hence the kep looks good to me.

/lgtm

deads2k · 2022-01-31T22:35:09Z

keps/sig-storage/1710-selinux-relabeling/README.md

 * **Is the rollout accompanied by any deprecations and/or removals of features,
  APIs, fields of API types, flags, etc.?**
  Even if applying deprecation policies, they may still surprise some users.

+  **YES!** See [Behavioral changes](#behavioral-changes) and [Implementation phases](implementation-phases) above.


The new failure mode looks slightly better to my eyes, but it still looks difficult to debug as a pod author.

I improved the error messages where kubelet can know what's wrong.

deads2k · 2022-01-31T22:35:54Z

keps/sig-storage/1710-selinux-relabeling/README.md


 * **What specific metrics should inform a rollback?**

+  TBD: We propose a metric above, file its name here.


heh. forgot to fill this in?

This should not be required for Alpha. We'll figure it out later.

ddebroy

A couple of minor nits/clarifications

ddebroy · 2022-02-01T17:00:44Z

keps/sig-storage/1710-selinux-relabeling/README.md


-### Examples
+1) Kubelet knows that the in-tree AWS CSI plugin supports mounting with `-o context`. The mount option is then used (if pod context is known) and the container runtime does not relabel the volume.


nit: in-tree AWS CSI plugin => in-tree EBS plugin

ddebroy · 2022-02-01T17:12:13Z

keps/sig-storage/1710-selinux-relabeling/README.md

+  * Each volume plugin can choose to use the mount option `-o context=` (e.g. when `CSIDriver.SELinuxRelabelPolicy` is `true`) or ignore it (e.g. in-tree volume plugins for shared filesystems or when `CSIDriver.SELinuxRelabelPolicy` is `false` or `nil`).
+  * Each volume plugin then returns `SupportsSELinux` from `GetAttributes()` call, depending on if it wants the container runtime to relabel the volume (`true`) or not (`false`; the volume was already mounted with the right label or it does not support SELinux at all).
+* When a CSI driver announces `SELinuxMountSupported: true`, kubelet will check that `-o context=X` was correctly applied after `NodePublish()`.
+  It will report error when the context in `/proc/mounts` does not match the expected value.


Will an error here (due to mismatch in /proc/mount wrt context) lead to the overall NodePublish being considered failed leading to retries with backoffs until the /proc/mount entry matches expectations? Or will this be a warning event that does not block pod sandbox creation?

My initial idea that it will be a SetUp() error, i.e. kubelet will retry SetUp() again, until the driver fixes it. The pod won't start.

Question is, if the driver can fix it, if it mounted the volume wrong at the first attempt - any retry probably won't help much here.

Rewording the bullet to mention the consequence - the pod won't ever start, the CSI driver is faulty.

deads2k · 2022-02-01T19:58:23Z

keps/sig-storage/1710-selinux-relabeling/README.md

@@ -403,7 +403,7 @@ _This section must be completed when targeting alpha to a release._

 * **How can this feature be enabled / disabled in a live cluster?**
  - [X] Feature gate (also fill in values in `kep.yaml`)
-    - Feature gate name: SELinuxRelabelPolicy
+    - Feature gate name: `SELinuxMountReadWriteOncePod`
    - Components depending on the feature gate: apiserver (API validation only), kubelet
  - [ ] Other
    - Describe the mechanism: 


I didn't notice this last time. You're missing an answer for this section. i believe you need need a drain and restart of the kubelet to change this setting. Please describe that.

I didn't notice this last time. You're missing an answer for this section. i believe you need need a drain and restart of the kubelet to change this setting. Please describe that.

I see the answer a few questions down.

deads2k · 2022-02-01T19:59:24Z

the monitoring and troubleshooting sections are often the hardest when moving from alpha to beta. I encourage you to consider those flows while developing the alpha.

The PRR is complete for alpha.

/approve

…text And other minor edits

jsafrane · 2022-02-02T16:15:46Z

Added a commit with last minute edits.

xing-yang · 2022-02-02T19:04:24Z

Can you squash the commits?

/lgtm
/approve

/hold

k8s-ci-robot · 2022-02-02T19:04:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jsafrane, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-storage/OWNERS~~ [jsafrane,xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xing-yang · 2022-02-02T19:27:15Z

/hold cancel

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 21, 2022

jsafrane force-pushed the rework-selinux branch from b4239ca to 2aae0a6 Compare January 21, 2022 14:32

k8s-ci-robot requested review from saad-ali and xing-yang January 21, 2022 14:32

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 21, 2022

Rework SELinux KEP to mount

90f2c02

Propose usage of "mount -o context=XYZ" to speed up (remove) recursive re-labeling of volumes during pod startup.

jsafrane force-pushed the rework-selinux branch from 2aae0a6 to edff64b Compare January 21, 2022 14:33

jsafrane mentioned this pull request Jan 21, 2022

Speed up recursive SELinux label change #1710

Open

27 tasks

gnufied reviewed Jan 21, 2022

View reviewed changes

keps/sig-storage/1710-selinux-relabeling/README.md Show resolved Hide resolved

ddebroy reviewed Jan 21, 2022

View reviewed changes

keps/sig-storage/1710-selinux-relabeling/README.md Outdated Show resolved Hide resolved

keps/sig-storage/1710-selinux-relabeling/README.md Outdated Show resolved Hide resolved

ddebroy reviewed Jan 21, 2022

View reviewed changes

keps/sig-storage/1710-selinux-relabeling/README.md Outdated Show resolved Hide resolved

keps/sig-storage/1710-selinux-relabeling/README.md Outdated Show resolved Hide resolved

keps/sig-storage/1710-selinux-relabeling/README.md Show resolved Hide resolved

jsafrane force-pushed the rework-selinux branch 6 times, most recently from e71b9d8 to 437a681 Compare January 24, 2022 16:56

jsafrane changed the title ~~WIP: 1710: Use mount to apply SELinux label to volumes~~ 1710: Use mount to apply SELinux label to volumes Jan 24, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2022

jsafrane force-pushed the rework-selinux branch from 437a681 to 77562fd Compare January 24, 2022 17:00

gnufied reviewed Jan 24, 2022

View reviewed changes

xing-yang reviewed Jan 25, 2022

View reviewed changes

jsafrane changed the title ~~1710: Use mount to apply SELinux label to volumes~~ KEP-1710: Use mount to apply SELinux label to volumes Jan 25, 2022

saschagrunert reviewed Jan 26, 2022

View reviewed changes

jsafrane force-pushed the rework-selinux branch 2 times, most recently from 68637d6 to ba0ed80 Compare January 26, 2022 14:53

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2022

jsafrane force-pushed the rework-selinux branch from 90ec079 to ec30aca Compare January 28, 2022 19:18

jsafrane added 2 commits January 28, 2022 20:58

Add SELinux label tracking to kubelet

54e2186

Mark as implementable

14c8b48

jsafrane force-pushed the rework-selinux branch from ec30aca to 14c8b48 Compare January 28, 2022 19:59

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2022

deads2k reviewed Jan 31, 2022

View reviewed changes

Add better error messages

3d32132

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022

Add context check after mount

693dce4

ddebroy reviewed Feb 1, 2022

View reviewed changes

deads2k reviewed Feb 1, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022

Specify what happens when CSI driver doesn't mount with the right con…

6f60b85

…text And other minor edits

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2022

k8s-ci-robot assigned xing-yang Feb 2, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2022

k8s-ci-robot merged commit a40b53b into kubernetes:master Feb 2, 2022

k8s-ci-robot added this to the v1.24 milestone Feb 2, 2022

		* kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`).
		Files on the volume already have the right SELinux context, no recursive relabeling happens there.


		* What specific metrics should inform a rollback?

		TBD: We propose a metric above, file its name here.


		### Examples
		1) Kubelet knows that the in-tree AWS CSI plugin supports mounting with `-o context`. The mount option is then used (if pod context is known) and the container runtime does not relabel the volume.

KEP-1710: Use mount to apply SELinux label to volumes #3172

KEP-1710: Use mount to apply SELinux label to volumes #3172

Conversation

jsafrane commented Jan 21, 2022 • edited Loading

ddebroy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Jan 28, 2022

gnufied commented Jan 31, 2022

gnufied commented Jan 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddebroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Feb 1, 2022

jsafrane commented Feb 2, 2022

xing-yang commented Feb 2, 2022

k8s-ci-robot commented Feb 2, 2022

xing-yang commented Feb 2, 2022

jsafrane commented Jan 21, 2022 •

edited

Loading

ddebroy left a comment •

edited

Loading

jsafrane Jan 25, 2022 •

edited

Loading

deads2k Jan 31, 2022 •

edited

Loading